large-kernels Search Results

1000+ results
for large-kernels

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ml-explore/mlx #811

[Feature] Support fft based convolution

It would be nice to have FFT based convolution supported in mlx. FFT bases convolution shows much better performance for large images / arrays and kernels. The FFT building blocks are already support…

adonath updated 3 weeks ago
10
NVIDIA/cutlass #1780

[FEA] transpose in epilogue/prologue

Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibi…

xiaonans updated 1 week ago
5
huggingface/transformers #32861

Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to H…

### Feature request Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to HuggingFace Trainer, user could decide whether to enable kernel with a simple flag ### Motivation Liger (Linkedi…

JasonZhu1313 updated 3 days ago
4
intel/intel-xpu-backend-for-triton #2251

[Feature Improvement] Change large GRF warnings to trigger b…

PR https://github.com/intel/intel-xpu-backend-for-triton/pull/1654 has been introduced using large GRF mode automatically. Could we make the `cout` in these lines be triggered by a debug-only flag…

Stonepia updated 5 days ago
1
pytorch/audio #2469

large resampling kernels slow ALSO on the forward pass

### 🐛 Describe the bug -- i understand i still have to respond to my PR on kernel creation speed (sorry about that!) - but I found another problem when trying to convolve 96kHz 2 sec long room impuls…

xvdp updated 2 years ago
1
quokka-astro/quokka #569

set `-mllvm -amdgpu-function-calls=true` for HIP builds

**Describe the proposal** We should automatically add `-mllvm -amdgpu-function-calls=true` to the compiler flags when `-DAMREX_GPU_BACKEND=HIP` (AMD GPUs). This works around compiler bugs for large G…

BenWibking updated 1 month ago
4
pytorch/pytorch #21462

Slow convolution with large kernels, should be using FFT

## 🐛 Bug When using `Conv1d` with a large kernel size (1024 for instance) on gpu, the cudnn implementation is very slow and gets slower as I increase the kernel size. I thought it was using FFT but…

adefossez updated 3 years ago
12
rapidsai/cudf #14953

[FEA] Implement a templated parquet decoding kernel suitable…

As part of the drive towards implementing the micro-kernel parquet decoding strategy, we would like to start centralizing the core parquet decoding loop into a generic templated implementation that ca…

nvdbaranec updated 1 week ago
3
pytorch/pytorch #135095

Add MSCCL++ as a communication backend for PyTorch

### 🚀 The feature, motivation and pitch MSCCL++ redefines inter-GPU communication interfaces, offering a highly efficient and customizable communication stack tailored for distributed GPU application…

lerrorgk updated 5 days ago
5
rapidsai/cudf #15620

[FEA] Use SMs to submit small copies to prevent serializatio…

We have seen patterns where small `cudaMemcpyAsync` collide with large `cudaMemcpyAsync` being handled by the copy engine. Importantly, the small copy is in a different stream than the large copy. In…

abellina updated 3 months ago
5

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for large-kernels

1000+ results
for large-kernels