-
I really like the simplicity of TK and think it could be broadly applicable to kernel authoring beyond attention. Has there been any benchmarking done of pure GEMM operations? If so, an example would …
-
### 🐛 Describe the bug
Compilation of the flash attention CUDA kernels takes a lot of RAM. For example, on my machine, compiling `aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_h…
-
### Summary of Problem
The following code produces the error "gpu-nvidia.c:292: Error calling CUDA function: an illegal memory access was encountered".
```chapel
const D = {0..
-
**Output of 'strings libarm_compute.so | grep arm_compute_version':**
arm_compute_version=v23.11 Build options: {'Werror': '0', 'debug': '0', 'neon': '1', 'opencl': '0', 'embed_kernels': '0', 'os…
-
### System Info
ubuntu 20.04
tensorrt 10.0.1
tensorrt-cu12 10.0.1
tensorrt-cu12-bindings 10.0.1
tensorrt-cu12-libs 10.0.1
tensorrt-llm 0.10.…
-
## Enhancement
@trilinos/ifpack2 @csiefer2 @srajama1 @vqd8a
Ifpack2's native implementation of RILUK depends on UVM if compiled for CUDA. Some of the associated unit tests have been modified to …
-
### Summary
Last year, we released [pytorch-labs/torchao](https://github.com/pytorch-labs/ao) to provide acceleration of Generative AI models using native PyTorch techniques. Torchao added support …
-
Steps to reproduce this issue:
1. Install Clear Linux (I'm currently on 27320)
2. `sudo swupd bundle-add kernel-iot-lts2018`
3. Check kernels available (note: my system was recently updated so I ha…
-
Expected release date: Mar 15th, 2024
# General
1. [x] Support general page table layout (@yzh119 )
2. [ ] sm70/75 compatibility (@yzh119 )
3. [ ] performance: using fp16 as intermediate data ty…
-
### 🐛 Describe the bug
I gathered all the 10K triton kernels generated by inductor using stack of PR ( https://github.com/pytorch/pytorch/pull/120048 ). After deduping same kernels used by different …