-
Hi,
I found that the unpad_input function makes the cuda graph capture fail if we have key_attention_mask.
https://github.com/HazyResearch/flash-attention/blob/72ad03eaa661f6bf3a14c855316c27fbab4f…
-
I want to use CUDA instead of CPU to increase the speed on tag inference.
My machine Ubuntu 22.04.3 LTS (GNU/Linux 6.5.0-35-generic x86_64), CUDA 12.2
I learned from https://onnxruntime.ai/docs/…
-
### Bug Description
**Description**:
I've encountered an issue where the `example_soft_body` example in `example.sim` remains in its initial state and does not move when `sim_substep` is set to an o…
-
### Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related iss…
-
The autotuner is currently tied closely to the CUDA backend: it takes a bunch of CUDA-specific parameters and then passes them to `do_bench` or `do_bench_cudagraph`, which both call into many CUDA-spe…
-
### 🐛 Describe the bug
Running the model for training with cuda-graphs enables
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation -…
-
### 🐛 Describe the bug
I get an BackendCompilerFailed error when trying to compile flexattention with a block mask.
Here is a minimal working sample to reproduce the error:
```
import torch…
-
### Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md)…
-
### Describe the issue
I'm using onnx-tensorrt.
When I enable the trt_cuda_graph_enable like this:
![image](https://github.com/microsoft/onnxruntime/assets/67405690/0f239de5-f995-43df-aa8a-805674…
-
### Feature request
In my experiments, I cannot get torch CUDA graphs to work with HF generate. CUDA graphs work fine when calling the forward pass of a model, but either due to static input/output s…