-
Consider implementing the Liger Kernels which has shown to yield large memory savings.
- RoPE: 3X speedup with ~3X peak memory reduction.
- SwiGLU: 1.5X peak memory reduction
- Cross Entropy: >4X…
-
**Is your feature request related to a problem? Please describe.**
No, problem, Just an addition of an interesting bacbone to the timm library
**Describe the solution you'd like**
Addition of Uni…
-
### Description
We do not have support for fp32 accumulate in sdpa family kernels. This becomes a problem when number of chunks gets large and we see diverging pcc from ground truth. For models that …
-
### Describe the enhancement requested
Is there a larger plan to start adding compute kernels for the binary view types? I see dedicated issues like https://github.com/apache/arrow/issues/43010 but I…
-
## Description
Consider adding additional FusedCrossEntropyLoss kernel to FOAK set of kernels given the additional improvement seen using it in earlier tests (See Background below).
Considerati…
-
Hi, thank you for great work and efforts.
Current kernels seem to support only dimensions of 7B models with hidden dimension 4096.
How can I extend it for larger models like Llama-30B or 65B?
It …
-
Using tiny tiles bfp8 format output, the matmul gives first row all 0s (16 0s)
The dest register has the correct value. Implying there is a problem with the packing.
**To Reproduce**
Branch : …
-
In ROCm compilers as of early 2024, the compiler force inlines *everything*.
While generally nice, this can be problematic for very large kernels in both compile and runtime, if we actually want to…
ax3l updated
8 months ago
-
Some programs have issues when running on kernels with larger pagesizes.
### jemalloc
One common case is programs (especially rust programs) that use jemalloc. this was fixed in #48194 for the j…
-
I really love this project and the accompanying blogpost, so thanks! I've reimplemented some of the inference techniques to speed up an implementation of Whisper that I am using. I had a few questions…