-
I added CUDA Graphs: https://developer.nvidia.com/blog/cuda-graphs/
So now you can add to any cfg-file:
```
[net]
use_cuda_graph = 1
```
and Detection will be +20% faster on GPU (**starting from…
-
Running the default example doesn't work:
```text
Namespace(verbose=True, batch_size_for_cuda_graph=1, chat_template='', model='.\\example-models\\phi2-int4-directml')
Loading model...
Model loa…
-
### 🚀 The feature, motivation and pitch
I noticed that the inductor has registered the PrivateUse1 backend, but the implementation in cudagraph is all hard-coded for CUDA. such as https://github.com/…
-
### Describe the issue
I have built onnxruntime-gpu 1.4.0 following . The output of `import onnxruntime` and `onnxruntime.get_device()` are both normal, and `onnxruntime.InferenceSession()` seems ok…
-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [X] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related iss…
-
### 🐛 Describe the bug
When I `torch.compile` some code that receives an argument of type nn.Module, the recompilation is triggered on each call with a different instance. Expected it to recompile on…
-
Hi there! Thank you for your amazing work on implementing the faster components for transformer-based models! I've found that you have multiple gpu kernels in an encoder or decoder. Have you ever trie…
-
They're designed to reduce the overhead in the exact scenario used by GeNN:
> Loop over timesteps
> …
> shortKernel1
> shortKernel2
> …
> shortKernelN
See https://devblogs…
-
### 🚀 The feature, motivation and pitch
vLLM only enables cuda graph for decoding-only batches (mainly because it didn't see big perf improvement if batched token length > 256). This behavior is pres…
-
### Question
I want to add a graph to the observation space by
```
# Create NetworkX graph
G = nx.Graph()
# Add nodes (C-alpha atoms) with 320-dimensional zero embeddings
for _, ro…