TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
148
stars
10
forks
source link
feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 2) #111
Closed
KuangjuX closed 3 months ago
The current PR may not be a fully correct version, and more checks and tests will be done in the subsequent PRs.