-
@npuichigo I am trying to use [Triton Inference Server with TensorRT-LLM backend](https://nvidia.github.io/TensorRT-LLM/quick-start-guide.html#deploy-with-triton-inference-server) with [openweb-ui](ht…
-
The Triton Inference Server supports TensorRT models and our the Triton Serving Runtime [indicates this](https://github.com/kserve/modelmesh-serving/blob/main/config/runtimes/triton-2.x.yaml#L28).
…
-
### System Info
**Hardware:**
- CPU architecture: x86_64
- CPU memory size:
- L1d cache: 2 MiB
- L1i cache: 2 MiB
- L2 cache: 64 MiB
- L3 cache: 256 MiB
- GPU name: NVIDIA A100 80GB PCIe
…
-
Hi there,
I have been finetuning whisper models using huggingface. Further to convert the model to TensorRT_LLM format, i use a HF script that converts the models from its HF format to the original …
-
### System Info
- docker image: nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
- tensorrt_llm: 0.9.0
### Who can help?
@kaiyux @byshiue
### Information
- [ ] The official example scripts…
-
**Description**
I am testing tritonserver on the example models fetched using this script:
https://github.com/triton-inference-server/server/blob/main/docs/examples/fetch_models.sh
triton serve…
-
### System Info
I have searched the repo here and the main server repo but don't see any information on either a) support for Safetensors (many models are saved that way on HF) and also b) whether th…
-
Looking at the release of TensorRT 9.1.0. I am very happy to see the integration of openai-triton with TensorRT plugins.
However [one limitation of this integration is that python must be availabl…
-
**Is your feature request related to a problem? Please describe.**
Currently the fastest way of executing models for Computer Vision inference is by running a TensorRT-optimised model. It is widely a…
-
**Environments:**
- os: ubuntu server 22.04 LTS
- gpu: H100*2
- docker-ce: 5:27.1.2
- nvidia-container-toolkit: 1.16.1
- image: styler00dollar/vsgan_tensorrt:latest (08/15/2024)
- commit: …