TiledTensor / TiledCUDA

TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
MIT License
157 stars 10 forks source link

feat(cell): Add related element-wise/unary/copy implementation for flash-attn(phase 2) #111

Closed KuangjuX closed 3 months ago

KuangjuX commented 3 months ago

The current PR may not be a fully correct version, and more checks and tests will be done in the subsequent PRs.