issues
search
TiledTensor
/
TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
114
stars
9
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
feat(cell): shared to global store for single precision floating point elements.
#138
haruhi55
opened
5 days ago
0
refactor(cell): refactor the register to shared storer.
#137
haruhi55
closed
5 days ago
0
fix(examples): small bug fix.
#136
haruhi55
closed
1 week ago
0
feat(examples): Add a python gemm example.
#135
haruhi55
closed
1 week ago
0
Make register to shared storer support for swizzled shared memory
#133
haruhi55
opened
1 week ago
0
chore: Add `TiledFlashAttention` to improve usage and fix `CMakeLists` to add all examples automatically.
#132
KuangjuX
closed
2 weeks ago
0
Add Lastest News in README.
#131
KuangjuX
closed
2 weeks ago
0
feat(util): Add helper functions to print RegTile.
#130
KuangjuX
closed
2 weeks ago
0
Add a helper function to print the data processed by registers in a warp.
#129
KuangjuX
closed
2 weeks ago
0
feat(kernel): Add a pytorch bind for flashattention.
#128
KuangjuX
closed
2 weeks ago
0
Support for swizzled functions for a BaseTile with single precision floating point elements
#127
haruhi55
opened
3 weeks ago
0
fix(cell): bug fix for global to shared memory data loader.
#126
haruhi55
closed
3 weeks ago
0
feat(examples): Add the Host side implementation of FlashAttention and pass Single Traversel FlashAttention correctness in CUDA side.
#125
KuangjuX
closed
3 weeks ago
0
feat(examples): Improve the implementations of fused two gemms.
#124
haruhi55
closed
1 month ago
0
feat(examples): Add a simple FlashAttention Implementation.
#123
KuangjuX
closed
3 weeks ago
1
Add a simple FlashAttention based on Back2Back GEMM.
#122
KuangjuX
closed
1 week ago
0
Add a Simple Back2Back GEMM example based on `BaseTile`.
#121
KuangjuX
closed
1 month ago
0
Include the `examples` subdirectory in the root `CMakeLists`.
#120
KuangjuX
closed
2 weeks ago
0
(feat): Implement swizzled column-major shared memory layout.
#119
haruhi55
closed
1 month ago
0
Add a Back2Back GEMM based on `BaseTile`.
#118
KuangjuX
closed
1 month ago
0
Implement the SharedToGlobal storer for single precision floating point numbers
#117
haruhi55
opened
1 month ago
0
Add shared memory swizzling for column-major layout.
#116
haruhi55
closed
1 month ago
0
fix(cell): Make shared memory swizzling correctly work with TileIterator.
#115
haruhi55
closed
1 month ago
0
fix tile iterator with swizzled shared memory.
#114
haruhi55
closed
1 month ago
0
fix(unittest): fix the GEMM unittest.
#113
haruhi55
closed
1 month ago
0
fix(cell): bug fix for swizzled shared memory layout.
#112
haruhi55
closed
1 month ago
0
feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 2)
#111
KuangjuX
closed
1 month ago
0
refactor(cell): Refactor global to shared Tile transfer on basis of `BaseTile`
#110
haruhi55
closed
1 month ago
0
feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 1)
#109
KuangjuX
closed
1 month ago
0
feat(cell): Add a simple `Broadcast` implementation between `RegTile` and reduce tile.
#108
KuangjuX
closed
1 month ago
0
Broadcast the Reduce results into the `RegTile`.
#107
KuangjuX
closed
1 month ago
1
feat(cell): Add the implementation of the computations required for FlashAttention, besides the GEMM computation.
#106
KuangjuX
closed
1 month ago
0
feat(cell): Add a Column-Major reduce implementation.
#105
KuangjuX
closed
1 month ago
0
Re-design and Re-implement the swizzled shared memory layout.
#104
haruhi55
closed
1 month ago
1
refactor(layout): propagate swizzled shared memory layout by `TileIterator`.
#103
haruhi55
closed
1 month ago
0
feat(cell): Add a row major softmax implementation in a single warp.
#102
KuangjuX
closed
1 month ago
0
feat(cell): Add a Reg level reduce based on `RegTile`.
#101
KuangjuX
closed
1 month ago
0
A buggy implementation of the TileIterator.
#100
haruhi55
opened
1 month ago
2
Add Warp Reduce based on `RegTile`.
#99
KuangjuX
closed
1 month ago
1
feat(unittest): Add unittest to ensure the correctness of swizzled layout.
#98
haruhi55
closed
1 month ago
0
feat(cell): Hide CuTe's layout inside macro kernel's implementations.
#97
haruhi55
closed
1 month ago
0
feat(cell): warp CuTe's layout and hide CuTe's layout inside macro kernel's implementation
#96
haruhi55
closed
1 month ago
0
fix(unittest): Update the GEMM unittest to use the new global to shared tile transfer.
#95
haruhi55
closed
1 month ago
0
feat(util): Add helper function for CUDA timer.
#94
haruhi55
closed
1 month ago
0
Refactor(README): Refactor the README to adapt to the current code implementation.
#93
KuangjuX
closed
1 month ago
1
fix(examples): Enhance the GEMM example to process large input matrices using `TileIterator`.
#92
haruhi55
closed
1 month ago
0
Refactor: Design the Swizzled Layout transformation and add a Warp-based Swizzled Thread Layout.
#91
KuangjuX
closed
1 month ago
1
Design the Swizzled Layout transformation and add a Warp-based Swizzled Thread Layout.
#90
KuangjuX
closed
1 month ago
0
Refactor(cell): Implement data tile from Global To Shared.
#89
KuangjuX
closed
1 month ago
0
feat(examples): Add the hello world example for gemm.
#88
haruhi55
closed
1 month ago
0
Next