issues
search
TiledTensor
/
TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
158
stars
10
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
fix(core): Bug fix for load column-major register tile for tensor core gemm.
#60
haruhi55
closed
4 months ago
0
fix(unittest): bug fix for tensor core gemm.
#59
haruhi55
closed
4 months ago
0
Transfer data tile from global memory to registers.
#58
haruhi55
closed
4 months ago
0
Enhance the unit tests for storing Tensor Core's WMMA output tile.
#57
haruhi55
closed
2 months ago
0
Need for clean organization of register tile for Tensor Core output
#56
haruhi55
closed
3 months ago
0
pass unittest for tensor core gemm.
#55
haruhi55
closed
4 months ago
0
feat(core): store register tile to shared memory.
#54
haruhi55
closed
4 months ago
0
(feat): transfer data tile from shared memory to register using ldmatrix.
#53
haruhi55
closed
5 months ago
0
warp `ldmatrix` to implement shared to register copy.
#52
haruhi55
closed
5 months ago
0
Upgrade the C++ standard to C++20.
#51
haruhi55
closed
5 months ago
0
(feat): Add a straightforward implementation for tile iterator.
#50
haruhi55
closed
5 months ago
0
Add a straightforward implementation for TileIterator
#49
haruhi55
closed
5 months ago
0
(feat): add a possible interfaces for register-level gemm.
#48
haruhi55
closed
5 months ago
0
🚧 Warp wmma.
#47
haruhi55
closed
5 months ago
0
`TileShape` is insufficient to fully describe a copy plan.
#46
haruhi55
closed
5 months ago
0
Enable support for row-major and column-major shared memory tiles
#45
haruhi55
closed
5 months ago
0
Enhancing shared memory access for 2D warp organization
#44
haruhi55
closed
5 months ago
0
wrap wmma.
#43
haruhi55
closed
5 months ago
0
Bug fix and add unittest for load data using ldmatrix.
#42
haruhi55
closed
6 months ago
2
Enable `cp.async` for transferring data from global memory to shared memory.
#41
haruhi55
closed
6 months ago
0
Enable `cp.async` when load data from global memory to shared memory.
#40
haruhi55
closed
6 months ago
0
update cultass to 3.5.0.
#39
haruhi55
closed
6 months ago
0
Ensure consistency in the use of swizzled shared memory layout
#38
haruhi55
closed
3 months ago
0
The gemm kernel does not use swizzled shared memory layout.
#37
haruhi55
closed
6 months ago
0
Update cutlass version to 3.5.0
#36
KuangjuX
closed
6 months ago
0
unittest for the copy_s2r macro kernel to ensure the correctness.
#35
haruhi55
closed
6 months ago
0
Implement the macro kernel that stores data from register to shared memory.
#34
haruhi55
closed
4 months ago
0
copy data from shared memory into register
#33
haruhi55
closed
6 months ago
0
Implement a small fixed length device array.
#32
haruhi55
closed
6 months ago
0
An gemm implementation exposes minimal program concepts.
#31
haruhi55
closed
5 months ago
0
Implement gemm by directly issue wmma under macro kernel.
#30
haruhi55
closed
5 months ago
0
Implement data transfer between shared memory and register file using ldmatrix.
#29
haruhi55
closed
6 months ago
0
data transfer between shared memory and register.
#28
haruhi55
closed
6 months ago
0
Reduce the number of warnings in the building.
#27
haruhi55
closed
6 months ago
0
Add multi-staged pipelined GEMM.
#26
haruhi55
opened
6 months ago
0
feat(unittest): Implement basic unittest for transferring 2D data tiles between global and shared memory
#24
haruhi55
closed
7 months ago
0
incorrect version for the googletest submodule.
#23
haruhi55
closed
7 months ago
0
fix(unittest): bug fix for python unittest.
#22
haruhi55
closed
7 months ago
0
feat(cmake): port googletest into the project
#21
haruhi55
closed
7 months ago
0
Distracting cmake warnings if `PYTHON_EXECUTABLE` is not set when building `torchlib`
#20
haruhi55
closed
7 months ago
0
feat(test): Add test for Lstm Cell kernel.
#19
KuangjuX
closed
7 months ago
0
Is it possible to remove the hard-coded path of cuda compiler in cmake?
#18
haruhi55
closed
7 months ago
2
Port gtest into this project.
#17
haruhi55
closed
7 months ago
0
Add flash attention based on b2b GEMM
#16
KuangjuX
closed
2 months ago
0
chore: Enable `-Werror` and fix all warnings.
#13
KuangjuX
closed
7 months ago
0
Enable `-Werror` and fix warnings.
#12
KuangjuX
closed
7 months ago
0
feat(kernel): Add Batched Gemm kernel.
#11
KuangjuX
closed
7 months ago
0
chore: fix kernel function parameter type.
#10
KuangjuX
closed
7 months ago
0
chore: Add Cuda Info functions.
#9
KuangjuX
closed
7 months ago
0
Add test for Lstm Cell.
#8
KuangjuX
closed
7 months ago
0
Previous
Next