-
It would be nice to have FFT based convolution supported in mlx. FFT bases convolution shows much better performance for large images / arrays and kernels. The FFT building blocks are already support…
-
Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibi…
-
### Feature request
Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to HuggingFace Trainer, user could decide whether to enable kernel with a simple flag
### Motivation
Liger (Linkedi…
-
PR https://github.com/intel/intel-xpu-backend-for-triton/pull/1654 has been introduced using large GRF mode automatically.
Could we make the `cout` in these lines be triggered by a debug-only flag…
-
### 🐛 Describe the bug
-- i understand i still have to respond to my PR on kernel creation speed (sorry about that!) - but I found another problem when trying to convolve 96kHz 2 sec long room impuls…
-
**Describe the proposal**
We should automatically add `-mllvm -amdgpu-function-calls=true` to the compiler flags when `-DAMREX_GPU_BACKEND=HIP` (AMD GPUs). This works around compiler bugs for large G…
-
## 🐛 Bug
When using `Conv1d` with a large kernel size (1024 for instance) on gpu, the cudnn implementation is very slow and gets slower as I increase the kernel size. I thought it was using FFT but…
-
As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that ca…
-
### 🚀 The feature, motivation and pitch
MSCCL++ redefines inter-GPU communication interfaces, offering a highly efficient and customizable communication stack tailored for distributed GPU application…
-
We have seen patterns where small `cudaMemcpyAsync` collide with large `cudaMemcpyAsync` being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In…