TiledTensor TiledCUDA issues

TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

MIT License

158 stars 10 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

fix(core): Bug fix for load column-major register tile for tensor core gemm.

#60 haruhi55 closed 4 months ago
0
fix(unittest): bug fix for tensor core gemm.

#59 haruhi55 closed 4 months ago
0
Transfer data tile from global memory to registers.

#58 haruhi55 closed 4 months ago
0
Enhance the unit tests for storing Tensor Core's WMMA output tile.

#57 haruhi55 closed 2 months ago
0
Need for clean organization of register tile for Tensor Core output

#56 haruhi55 closed 3 months ago
0
pass unittest for tensor core gemm.

#55 haruhi55 closed 4 months ago
0
feat(core): store register tile to shared memory.

#54 haruhi55 closed 4 months ago
0
(feat): transfer data tile from shared memory to register using ldmatrix.

#53 haruhi55 closed 5 months ago
0
warp `ldmatrix` to implement shared to register copy.

#52 haruhi55 closed 5 months ago
0
Upgrade the C++ standard to C++20.

#51 haruhi55 closed 5 months ago
0
(feat): Add a straightforward implementation for tile iterator.

#50 haruhi55 closed 5 months ago
0
Add a straightforward implementation for TileIterator

#49 haruhi55 closed 5 months ago
0
(feat): add a possible interfaces for register-level gemm.

#48 haruhi55 closed 5 months ago
0
🚧 Warp wmma.

#47 haruhi55 closed 5 months ago
0
`TileShape` is insufficient to fully describe a copy plan.

#46 haruhi55 closed 5 months ago
0
Enable support for row-major and column-major shared memory tiles

#45 haruhi55 closed 5 months ago
0
Enhancing shared memory access for 2D warp organization

#44 haruhi55 closed 5 months ago
0
wrap wmma.

#43 haruhi55 closed 5 months ago
0
Bug fix and add unittest for load data using ldmatrix.

#42 haruhi55 closed 6 months ago
2
Enable `cp.async` for transferring data from global memory to shared memory.

#41 haruhi55 closed 6 months ago
0
Enable `cp.async` when load data from global memory to shared memory.

#40 haruhi55 closed 6 months ago
0
update cultass to 3.5.0.

#39 haruhi55 closed 6 months ago
0
Ensure consistency in the use of swizzled shared memory layout

#38 haruhi55 closed 3 months ago
0
The gemm kernel does not use swizzled shared memory layout.

#37 haruhi55 closed 6 months ago
0
Update cutlass version to 3.5.0

#36 KuangjuX closed 6 months ago
0
unittest for the copy_s2r macro kernel to ensure the correctness.

#35 haruhi55 closed 6 months ago
0
Implement the macro kernel that stores data from register to shared memory.

#34 haruhi55 closed 4 months ago
0
copy data from shared memory into register

#33 haruhi55 closed 6 months ago
0
Implement a small fixed length device array.

#32 haruhi55 closed 6 months ago
0
An gemm implementation exposes minimal program concepts.

#31 haruhi55 closed 5 months ago
0
Implement gemm by directly issue wmma under macro kernel.

#30 haruhi55 closed 5 months ago
0
Implement data transfer between shared memory and register file using ldmatrix.

#29 haruhi55 closed 6 months ago
0
data transfer between shared memory and register.

#28 haruhi55 closed 6 months ago
0
Reduce the number of warnings in the building.

#27 haruhi55 closed 6 months ago
0
Add multi-staged pipelined GEMM.

#26 haruhi55 opened 6 months ago
0
feat(unittest): Implement basic unittest for transferring 2D data tiles between global and shared memory

#24 haruhi55 closed 7 months ago
0
incorrect version for the googletest submodule.

#23 haruhi55 closed 7 months ago
0
fix(unittest): bug fix for python unittest.

#22 haruhi55 closed 7 months ago
0
feat(cmake): port googletest into the project

#21 haruhi55 closed 7 months ago
0
Distracting cmake warnings if `PYTHON_EXECUTABLE` is not set when building `torchlib`

#20 haruhi55 closed 7 months ago
0
feat(test): Add test for Lstm Cell kernel.

#19 KuangjuX closed 7 months ago
0
Is it possible to remove the hard-coded path of cuda compiler in cmake?

#18 haruhi55 closed 7 months ago
2
Port gtest into this project.

#17 haruhi55 closed 7 months ago
0
Add flash attention based on b2b GEMM

#16 KuangjuX closed 2 months ago
0
chore: Enable `-Werror` and fix all warnings.

#13 KuangjuX closed 7 months ago
0
Enable `-Werror` and fix warnings.

#12 KuangjuX closed 7 months ago
0
feat(kernel): Add Batched Gemm kernel.

#11 KuangjuX closed 7 months ago
0
chore: fix kernel function parameter type.

#10 KuangjuX closed 7 months ago
0
chore: Add Cuda Info functions.

#9 KuangjuX closed 7 months ago
0
Add test for Lstm Cell.

#8 KuangjuX closed 7 months ago
0

Previous Next