-
Hello,
I am currently experiencing an issue with the `triton-inference-server/tensorrt_backend` while trying to run a Baichuan model.
### Description
I have set `gpt_model_type=inflight_fused…
-
### System Info
GPU: NVIDIA T4 * 4
Driver Version: 550.54.15
CUDA: 12.4
Image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
TensorRT-LLM version: 0.11.0
### Who can help?
No response…
-
### System Info
- CPU Architecture x86_64
- GPU - A100-80GB
- CUDA version - 11
- Tensorrt LLM version : 0.9.0
- Triton server version - 2.46.0
- model : Llama3-7b
### Who can help?
_No respo…
-
### System Info
- Architecture: x86_64
- OS Ubuntu 22.04
- GPU: NVIDIA GeForce RTX 4090
- Gpu memory 2x24gb
- CPU max MHz: 5000.0000
- Driver Version: 535.183.01
- CUDA Version: 12.2
- Conta…
-
Hello, I want to deploy llama-3-8b quantized model using tritonserver I followed below steps to do this:
1. create container with nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 base image.
3.…
-
Can this be done by leveraging the onnxruntime work we already have as a back end?
As a preliminary step, learn to add a Cuda back end,
then change it to MIGraphX/ROCm
See [https://github.com…
-
**Description**
A clear and concise description of what the bug is.
![output_image](https://github.com/user-attachments/assets/bed4e808-a3e0-4225-96c4-04ae69c65a15)
**Triton Information**
…
-
I have a bert model that I am trying to deploy with Triton Inference Server using Tensorrt-LLM backend. But I am getting errors:
? Docker Image: 24.03
? TensorRT-LLM: v0.8.0
Error:
+-------+-…
-
@Rasantis hey!
Absolutely, YOLOv8 is designed with efficiency in mind and supports processing multiple video streams in real-time, including RTSP streams. For handling +20 cameras, implementation…
-
## Description
I have two different module and convert to trt. when I run them in Serial. the cost time of only infer:
```
//10 times
do_infer >> cost 400.60 msec. //warn-up
do_infer >> cost 42.22 …