-
**Describe the bug**
when I load a gpt2 model with onnxruntime-gpu, a lot of warning appeared. It shows that some node will be calculated on CPU.
Is this as expected or I made something wrong when c…
-
This issue will be used to track compilation failures for migraphx models on CPU and GPU. Compile failures for each model should have a link to an issue with a smaller reproducer in the notes column.
…
-
Hi again @golsun,
I've been working with DialogRPT using DialoGPT-large for dialog generation and have hit some performance issues that aren't present when using just DialoGPT-large. Round trip res…
-
Hello Team,
I am trying to execute the gpt-2 model (link given below) on Mali G710 GPU. During the execution I get the below error,
./ExecuteNetwork -c GpuAcc -f onnx-binary -d /mnt/dropbox/Mobi…
-
## Description
the news in https://github.com/dmlc/gluon-nlp/releases/tag/v0.8.1 shows BERT int8 quantization is presented in blog
https://medium.com/apache-mxnet/optimization-for-bert-inference-per…
-
### Describe the issue
I was intrigued by @tianleiwu 's [excellent blog post](https://medium.com/microsoftazure/accelerate-your-nlp-pipelines-using-hugging-face-transformers-and-onnx-runtime-244357…
-
## Summary of Contributions (9th Feb)
1) **Improve the number of models in TorchBench that work with Dynamo as a tracer:** These passing rates are now comparable to those from torch.compile using I…
-
**Is your feature request related to a problem? Please describe.**
Having defaults high number of GPU layers doesn't always work. For instance big models can overfit the card and constrain the us…
-
- [ ] [Measuring inference speed metrics for hosted and local LLM](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/README.html)…
-
Thanks for open-sourcing the code !
This approach is very interesting, but I'm curious about the impact on performance (inference speed).
**Is there any benchmark showing the impact on performan…