TiledTensor TiledCUDA issues

TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.

MIT License

157 stars 10 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

improve copy.

#161 haruhi55 opened 3 days ago
0
Discussion: Some examples for consideration.

#160 KuangjuX opened 4 days ago
0
fix(cell): Re-implement shared tile iterator and fixed all the unittests.

#159 haruhi55 closed 6 days ago
0
Document the Unique Swizzled Shared Memory Layout

#158 haruhi55 opened 1 week ago
0
fix(cell): delete unnecessary codes.

#157 haruhi55 closed 1 week ago
0
Analyze performance with ncu.

#156 KuangjuX opened 2 weeks ago
2
Bug: Some shape failed to execute in fused gemm.

#155 KuangjuX opened 2 weeks ago
0
feat(bench): Add benchmark between tiledcuda and cublas in fused gemm.

#154 KuangjuX closed 2 weeks ago
0
Bug: Clean up legacy code will cause compilation errors.

#153 KuangjuX closed 1 week ago
0
refactor(cell): re-implement how data is stored on shared memory to avoid bank conflicts.

#152 haruhi55 closed 2 weeks ago
0
fix(examples): bug fix for gemm's timing.

#151 haruhi55 closed 1 month ago
0
Add an example of convolution

#150 GonChen opened 1 month ago
0
fix(cell): store tiles to shared memory with no bank-conflicts.

#149 haruhi55 closed 1 month ago
0
chore: Minor code refinements.

#148 haruhi55 closed 1 month ago
0
feat(examples): An example of GEMM leveraging CUDA's three memory hierarchies.

#147 haruhi55 closed 1 month ago
0
fix(scripts): if glog is not installed locally, build it from source.

#146 haruhi55 closed 1 month ago
3
fix(cell): Reduce bank conflicts when accessing shared memory tiles with float data type

#145 haruhi55 closed 1 month ago
1
The `b2b_gemm` Example Fails Tests on A100

#144 haruhi55 opened 2 months ago
1
style(examples): Minor refinement to code organization by placing the kernel in a separate file.

#143 haruhi55 closed 2 months ago
0
Clean up legacy code.

#142 haruhi55 closed 2 weeks ago
0
Enhance the flash attention example to store data in shared memory using a swizzled layout

#141 haruhi55 closed 3 weeks ago
0
Enhance the b2b GEMM example to first store data in shared memory using a swizzled layout.

#140 haruhi55 closed 3 weeks ago
0
Add the example for a fully functional GEMM that utilizes all three levels of memory on a CUDA device.

#139 haruhi55 closed 1 month ago
0
feat(cell): shared to global store for single precision floating point elements.

#138 haruhi55 closed 2 months ago
0
refactor(cell): refactor the register to shared storer.

#137 haruhi55 closed 2 months ago
0
fix(examples): small bug fix.

#136 haruhi55 closed 2 months ago
0
feat(examples): Add a python gemm example.

#135 haruhi55 closed 2 months ago
0
Make register to shared storer support for swizzled shared memory

#133 haruhi55 closed 2 months ago
0
chore: Add `TiledFlashAttention` to improve usage and fix `CMakeLists` to add all examples automatically.

#132 KuangjuX closed 2 months ago
0
Add Lastest News in README.

#131 KuangjuX closed 2 months ago
0
feat(util): Add helper functions to print RegTile.

#130 KuangjuX closed 2 months ago
0
Add a helper function to print the data processed by registers in a warp.

#129 KuangjuX closed 2 months ago
0
feat(kernel): Add a pytorch bind for flashattention.

#128 KuangjuX closed 2 months ago
0
Support for swizzled functions for a BaseTile with single precision floating point elements

#127 haruhi55 closed 2 months ago
0
fix(cell): bug fix for global to shared memory data loader.

#126 haruhi55 closed 3 months ago
0
feat(examples): Add the Host side implementation of FlashAttention and pass Single Traversel FlashAttention correctness in CUDA side.

#125 KuangjuX closed 3 months ago
0
feat(examples): Improve the implementations of fused two gemms.

#124 haruhi55 closed 3 months ago
0
feat(examples): Add a simple FlashAttention Implementation.

#123 KuangjuX closed 3 months ago
1
Add a simple FlashAttention based on Back2Back GEMM.

#122 KuangjuX closed 2 months ago
0
Add a Simple Back2Back GEMM example based on `BaseTile`.

#121 KuangjuX closed 3 months ago
0
Include the `examples` subdirectory in the root `CMakeLists`.

#120 KuangjuX closed 2 months ago
0
(feat): Implement swizzled column-major shared memory layout.

#119 haruhi55 closed 3 months ago
0
Add a Back2Back GEMM based on `BaseTile`.

#118 KuangjuX closed 3 months ago
0
Implement the SharedToGlobal storer for single precision floating point numbers

#117 haruhi55 closed 2 months ago
0
Add shared memory swizzling for column-major layout.

#116 haruhi55 closed 3 months ago
0
fix(cell): Make shared memory swizzling correctly work with TileIterator.

#115 haruhi55 closed 3 months ago
0
fix tile iterator with swizzled shared memory.

#114 haruhi55 closed 3 months ago
0
fix(unittest): fix the GEMM unittest.

#113 haruhi55 closed 3 months ago
0
fix(cell): bug fix for swizzled shared memory layout.

#112 haruhi55 closed 3 months ago
0
feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 2)

#111 KuangjuX closed 3 months ago
0