issues
search
TiledTensor
/
TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
157
stars
10
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
improve copy.
#161
haruhi55
opened
4 days ago
0
Discussion: Some examples for consideration.
#160
KuangjuX
opened
4 days ago
0
fix(cell): Re-implement shared tile iterator and fixed all the unittests.
#159
haruhi55
closed
6 days ago
0
Document the Unique Swizzled Shared Memory Layout
#158
haruhi55
opened
1 week ago
0
fix(cell): delete unnecessary codes.
#157
haruhi55
closed
1 week ago
0
Analyze performance with ncu.
#156
KuangjuX
opened
2 weeks ago
2
Bug: Some shape failed to execute in fused gemm.
#155
KuangjuX
opened
2 weeks ago
0
feat(bench): Add benchmark between tiledcuda and cublas in fused gemm.
#154
KuangjuX
closed
2 weeks ago
0
Bug: Clean up legacy code will cause compilation errors.
#153
KuangjuX
closed
1 week ago
0
refactor(cell): re-implement how data is stored on shared memory to avoid bank conflicts.
#152
haruhi55
closed
2 weeks ago
0
fix(examples): bug fix for gemm's timing.
#151
haruhi55
closed
1 month ago
0
Add an example of convolution
#150
GonChen
opened
1 month ago
0
fix(cell): store tiles to shared memory with no bank-conflicts.
#149
haruhi55
closed
1 month ago
0
chore: Minor code refinements.
#148
haruhi55
closed
1 month ago
0
feat(examples): An example of GEMM leveraging CUDA's three memory hierarchies.
#147
haruhi55
closed
1 month ago
0
fix(scripts): if glog is not installed locally, build it from source.
#146
haruhi55
closed
1 month ago
3
fix(cell): Reduce bank conflicts when accessing shared memory tiles with float data type
#145
haruhi55
closed
1 month ago
1
The `b2b_gemm` Example Fails Tests on A100
#144
haruhi55
opened
2 months ago
1
style(examples): Minor refinement to code organization by placing the kernel in a separate file.
#143
haruhi55
closed
2 months ago
0
Clean up legacy code.
#142
haruhi55
closed
2 weeks ago
0
Enhance the flash attention example to store data in shared memory using a swizzled layout
#141
haruhi55
closed
3 weeks ago
0
Enhance the b2b GEMM example to first store data in shared memory using a swizzled layout.
#140
haruhi55
closed
3 weeks ago
0
Add the example for a fully functional GEMM that utilizes all three levels of memory on a CUDA device.
#139
haruhi55
closed
1 month ago
0
feat(cell): shared to global store for single precision floating point elements.
#138
haruhi55
closed
2 months ago
0
refactor(cell): refactor the register to shared storer.
#137
haruhi55
closed
2 months ago
0
fix(examples): small bug fix.
#136
haruhi55
closed
2 months ago
0
feat(examples): Add a python gemm example.
#135
haruhi55
closed
2 months ago
0
Make register to shared storer support for swizzled shared memory
#133
haruhi55
closed
2 months ago
0
chore: Add `TiledFlashAttention` to improve usage and fix `CMakeLists` to add all examples automatically.
#132
KuangjuX
closed
2 months ago
0
Add Lastest News in README.
#131
KuangjuX
closed
2 months ago
0
feat(util): Add helper functions to print RegTile.
#130
KuangjuX
closed
2 months ago
0
Add a helper function to print the data processed by registers in a warp.
#129
KuangjuX
closed
2 months ago
0
feat(kernel): Add a pytorch bind for flashattention.
#128
KuangjuX
closed
2 months ago
0
Support for swizzled functions for a BaseTile with single precision floating point elements
#127
haruhi55
closed
2 months ago
0
fix(cell): bug fix for global to shared memory data loader.
#126
haruhi55
closed
3 months ago
0
feat(examples): Add the Host side implementation of FlashAttention and pass Single Traversel FlashAttention correctness in CUDA side.
#125
KuangjuX
closed
3 months ago
0
feat(examples): Improve the implementations of fused two gemms.
#124
haruhi55
closed
3 months ago
0
feat(examples): Add a simple FlashAttention Implementation.
#123
KuangjuX
closed
3 months ago
1
Add a simple FlashAttention based on Back2Back GEMM.
#122
KuangjuX
closed
2 months ago
0
Add a Simple Back2Back GEMM example based on `BaseTile`.
#121
KuangjuX
closed
3 months ago
0
Include the `examples` subdirectory in the root `CMakeLists`.
#120
KuangjuX
closed
2 months ago
0
(feat): Implement swizzled column-major shared memory layout.
#119
haruhi55
closed
3 months ago
0
Add a Back2Back GEMM based on `BaseTile`.
#118
KuangjuX
closed
3 months ago
0
Implement the SharedToGlobal storer for single precision floating point numbers
#117
haruhi55
closed
2 months ago
0
Add shared memory swizzling for column-major layout.
#116
haruhi55
closed
3 months ago
0
fix(cell): Make shared memory swizzling correctly work with TileIterator.
#115
haruhi55
closed
3 months ago
0
fix tile iterator with swizzled shared memory.
#114
haruhi55
closed
3 months ago
0
fix(unittest): fix the GEMM unittest.
#113
haruhi55
closed
3 months ago
0
fix(cell): bug fix for swizzled shared memory layout.
#112
haruhi55
closed
3 months ago
0
feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 2)
#111
KuangjuX
closed
3 months ago
0
Next