-
## Description
Consider adding additional FusedCrossEntropyLoss kernel to FOAK set of kernels given the additional improvement seen using it in earlier tests (See Background below).
Considerati…
-
Currently CAGRA supports PQ compression with `pq_len=2` ad `pq_len=4`. A larger compression ratio can be achieved if we allow larger `pq_len` values, e.g. 8 and 16.
`pq_len` is a [template paramete…
-
Some programs have issues when running on kernels with larger pagesizes.
### jemalloc
One common case is programs (especially rust programs) that use jemalloc. this was fixed in #48194 for the j…
-
In ROCm compilers as of early 2024, the compiler force inlines *everything*.
While generally nice, this can be problematic for very large kernels in both compile and runtime, if we actually want to…
ax3l updated
7 months ago
-
I really love this project and the accompanying blogpost, so thanks! I've reimplemented some of the inference techniques to speed up an implementation of Whisper that I am using. I had a few questions…
-
Here is my understanding of the existing state of things and what I think we should be doing to make our lower-bit kernels more performant at both small and larger batch sizes. I'm making this an RFC …
-
**What is your question?**
Hello!
I’ve been exploring the Cutlass examples for GEMM and Convolution and noticed the use of double buffering.
https://developer.nvidia.com/blog/cutlass-linear-algebra-…
-
When employing the pocoMC package for Bayesian Inference runs using tellurium for modeling, we have encountered issues with parallelization. Using multiprocess(ing), we noticed a very big discrepancy …
-
### 🚀 The feature, motivation and pitch
I propose implementing int8 quantization support for vLLM, focusing initially on the KV cache. This feature will allow users to run larger models or increase b…
-
## Description
We need to create AWS architecture that meets our requirements SPICE processing.
## Requirements
- Nail down which kernels will be delivered from MOC (MOC -> POC -> SDC)
- Low …