-
Hi all! Trying to use global partitioning fails with the dynamo backend, and couldn't pinpoint why (tried various compilation parameters).
How to Reproduce:
System:
```
Cuda Driver Version: …
-
They're designed to reduce the overhead in the exact scenario used by GeNN:
> Loop over timesteps
> …
> shortKernel1
> shortKernel2
> …
> shortKernelN
See https://devblogs…
-
Hi there! Thank you for your amazing work on implementing the faster components for transformer-based models! I've found that you have multiple gpu kernels in an encoder or decoder. Have you ever trie…
-
### Describe the issue
Simple model with GEMM(DQ(Q(input0)), DQ(Q(input1)) quantizing FP32 -> FP8E4M3 fails to run using the CPU EP. It is runnable when using the CUDA EP.
An identical model using…
-
Platforms: linux, slow
This test was disabled because it is failing in CI. See [recent examples](https://hud.pytorch.org/flakytest?name=test_cuda_event_created_outside_of_graph_dynamic_shapes&suite=D…
-
Model: Qwen-14B-Chat (QWen2)
Dataset: https://huggingface.co/datasets/Hello-SimpleAI/HC3-Chinese/blob/main/open_qa.jsonl
Environment: 2 A30 GPU
Issue 1:
Error: can't init model correctly. Disab…
-
## 🚀 Feature
Allow users to specify regions where CUDA memory allocations are satisfied from a private pool.
## Motivation
CUDA graph capture is our main motivation. But it seems like a handy…
-
### 🐛 Describe the bug
After quantizing ResNet-18 model with PyTorch 2 Export Post Training Quantization it is not possible to export the model.
```python
import torch
from torchvision.model…
-
torch.compile can accelerate small batch sizes for llama-3 8B. However, it is sometimes slower for large batch size or tensor parallelism. We use this issue to track the performance and potential fix…
-
Great job!
We found that Quest is implemented on the previous version of flashinfer and some common feature are not support currently.
* bsz > 1
* GQA
* CUDA graph
Is there any plan to update t…