issues
search
TiledTensor
/
TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
157
stars
10
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
refactor(cell): Refactor global to shared Tile transfer on basis of `BaseTile`
#110
haruhi55
closed
3 months ago
0
feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 1)
#109
KuangjuX
closed
3 months ago
0
feat(cell): Add a simple `Broadcast` implementation between `RegTile` and reduce tile.
#108
KuangjuX
closed
3 months ago
0
Broadcast the Reduce results into the `RegTile`.
#107
KuangjuX
closed
3 months ago
1
feat(cell): Add the implementation of the computations required for FlashAttention, besides the GEMM computation.
#106
KuangjuX
closed
3 months ago
0
feat(cell): Add a Column-Major reduce implementation.
#105
KuangjuX
closed
3 months ago
0
Re-design and Re-implement the swizzled shared memory layout.
#104
haruhi55
closed
3 months ago
1
refactor(layout): propagate swizzled shared memory layout by `TileIterator`.
#103
haruhi55
closed
3 months ago
0
feat(cell): Add a row major softmax implementation in a single warp.
#102
KuangjuX
closed
3 months ago
0
feat(cell): Add a Reg level reduce based on `RegTile`.
#101
KuangjuX
closed
3 months ago
0
A buggy implementation of the TileIterator.
#100
haruhi55
closed
2 weeks ago
2
Add Warp Reduce based on `RegTile`.
#99
KuangjuX
closed
3 months ago
1
feat(unittest): Add unittest to ensure the correctness of swizzled layout.
#98
haruhi55
closed
3 months ago
0
feat(cell): Hide CuTe's layout inside macro kernel's implementations.
#97
haruhi55
closed
3 months ago
0
feat(cell): warp CuTe's layout and hide CuTe's layout inside macro kernel's implementation
#96
haruhi55
closed
3 months ago
0
fix(unittest): Update the GEMM unittest to use the new global to shared tile transfer.
#95
haruhi55
closed
3 months ago
0
feat(util): Add helper function for CUDA timer.
#94
haruhi55
closed
3 months ago
0
Refactor(README): Refactor the README to adapt to the current code implementation.
#93
KuangjuX
closed
3 months ago
1
fix(examples): Enhance the GEMM example to process large input matrices using `TileIterator`.
#92
haruhi55
closed
4 months ago
0
Refactor: Design the Swizzled Layout transformation and add a Warp-based Swizzled Thread Layout.
#91
KuangjuX
closed
3 months ago
1
Design the Swizzled Layout transformation and add a Warp-based Swizzled Thread Layout.
#90
KuangjuX
closed
3 months ago
0
Refactor(cell): Implement data tile from Global To Shared.
#89
KuangjuX
closed
4 months ago
0
feat(examples): Add the hello world example for gemm.
#88
haruhi55
closed
4 months ago
0
chore: rename `SharedTileIterator` into `TileIterator`
#87
haruhi55
closed
4 months ago
0
feat(tests): add unittest for gemm.
#86
haruhi55
closed
4 months ago
0
Rename `SharedTileIterator` to `TileIterator` to reduce redundancy
#85
haruhi55
closed
4 months ago
2
feat(cell): Add unit tests for GEMM.
#84
haruhi55
closed
4 months ago
1
refactor(cell): Refactor GEMM according to the changes in the RegTile definition.
#83
haruhi55
closed
4 months ago
0
Refactor: Implement data tile from Global to Shared based on `BaseTile`.
#82
KuangjuX
closed
4 months ago
0
chore: clean up namespace usages.
#81
haruhi55
closed
4 months ago
0
feat(cell): Implement storing tiles from register to global memory.
#80
KuangjuX
closed
4 months ago
0
feat(cell): Implement loading a column-major tile from global to register.
#79
KuangjuX
closed
4 months ago
0
chore: Add more Warp Reuse tests for G2Reg Loading.
#78
KuangjuX
closed
4 months ago
0
fix(cell): Fix the warp tile offset computation.
#77
haruhi55
closed
4 months ago
0
check diff.
#76
haruhi55
closed
4 months ago
0
Refactor(cell): Move Warp-related code into a new file.
#75
KuangjuX
closed
4 months ago
0
refactor(cell): Refactor Row-Major GlobalToReg Copy Plan.
#74
KuangjuX
closed
4 months ago
2
refactor(cell): Refactor shared to register loading using ldmatrix.
#73
haruhi55
closed
4 months ago
0
refactor(cell): delete the gemm unittest to facilicate the refactor.
#72
haruhi55
closed
4 months ago
0
add a new register tile implementation.
#71
haruhi55
closed
4 months ago
0
Add a simple BaseTile implementation.
#70
KuangjuX
closed
4 months ago
0
feat: Add a column-major load functor from global memory to reg.
#69
KuangjuX
closed
4 months ago
0
refactor(copy): refine the implementation of shared to register copy.
#68
haruhi55
closed
4 months ago
0
How to define the layout of register tile?
#67
haruhi55
closed
3 months ago
2
Provide a complete GEMM example.
#66
haruhi55
closed
2 months ago
2
Update CUDA Library search from `FindCUDA` to `FindCUDAToolkit`
#65
haruhi55
closed
4 months ago
1
feat: Add a Load data tile device function from global memory to register.
#64
KuangjuX
closed
4 months ago
0
Improve code quality by addressing compilation warnings
#63
haruhi55
closed
4 months ago
0
Download and build glog if it is not install locally.
#62
haruhi55
closed
1 month ago
0
Requires a criterion and a definition for BaseTile to utilize the hardware capabilities.
#61
haruhi55
closed
4 months ago
1
Previous
Next