-
As turbine LLM (and different quantizations schemes implemented and lowered as part of it) ramps up, the CPU backend needs to be able to compile and execute sample kernels in reasonable time frame. Se…
-
Currently RAPIDS libraries support static linkage to cudart via a CMake flag `CUDA_STATIC_RUNTIME`. This flag is leveraged by wheel builds and by the Spark-RAPIDS JNI (specifically for cudf), but it i…
vyasr updated
6 months ago
-
### Summary
Last year, we released [pytorch-labs/torchao](https://github.com/pytorch-labs/ao) to provide acceleration of Generative AI models using native PyTorch techniques. Torchao added support …
-
## 🐛 Bug
This is a lengthy issue/post detailing my observations with our distributed and bucketing performance. Some of these are actionable items and some are just observations to be aware of.
…
-
Hello, Kernel Developers!
I hope that somebody can help. :)
I'm trying to flash the kernel I build for ishtar (Xiaomi 13 Ultra).
I have fixed few errors which I have got with the Xiaomi github r…
-
Currently, the screen will freeze if a user runs a large cutde calculation on the GPU that also drives their monitors. This is avoided by chunking the calculation in `disp_blocks` and `disp_aca`. Chun…
-
It is not clear to me what is the recommended way of combining multiple computation kernels into one larger operation and reuse the code. For example if I have blur and gradient implemented separately…
-
### Describe the enhancement requested
Hello. While implementing join operation support for the Dictionary type, I encountered the following message.
I am attempting to support the Dictionary ty…
-
I am currently testing CP2K on the new CSCS machines with GH200 chips. In most cases, DBCSR behaves well (e.g. with the `benchmarks/QS/H2O-XXX.inp`) tests. However, when large block sizes are involved…
-
This design review encompasses the ongoing work in the branch https://github.com/ammarhakim/gkylzero/tree/gk-g0-app-gpu-hack2024 for the GPU hackathon.
The main restructuring of the code is a targ…