-
**Is your feature request related to a problem? Please describe.**
Yes, currently Triton Inference Server doesn't provide per-request inference time in the HTTP/gRPC response. This makes real-time pe…
teith updated
11 months ago
-
**Description**
two command:
### run with gpu
```
docker run \
-d \
--name \
--gpus device=0 \
--entrypoint /opt/tritonserver/bin/tritonserver \
-p $PORT:8000 \
-t :…
-
**Description**
When I followed the official guidance to convert the ONNX model to TensorRT format and started the Triton Server, I encountered the following error
![image](https://github.com/trit…
-
### Description
```shell
Host: linux amd64
GPU: RTX 3060
container version:22.12
GPT model converted from megatron (model files and configs are from gpt guide)
dockerfile:
----
ARG TRITON_SE…
-
Hi,
### **Is there any way to correct above mentioned examples while transcribing through whisper-triton?**
Model is not able to transcribe few words properly even though spelt normally.
For …
-
### System Info
Environment:
2 NVIDIA A100 with nvlink
Tensorrt-LLM Backend version v0.8.0
LLAMA2 engine built with paged_kv_cache and tp_size 2, world size 2
X86_64 arch
### Who can hel…
-
Hi,
I am trying to use MMpose in the Nvidia triton server but it does not support PyTorch model, it supports torchscript and ONNX, and a few others. So, I have converted MMpose mobilenetv2 model to…
-
Hi, is there any guide how to implement Yolo v4 TAO model into Triton inference server? I have trained Yolo v4 custom data model via TAO toolkit and looking for an guide how to implement this model wi…
-
I tested `tritonclient:2.43.0` on Ubuntu:22.04 with `grpcio:1.62.1` and was confronted with a memory leak. Example for reproduction:
```
import asyncio
from tritonclient.grpc.aio import Inferen…
-
Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types.
Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would hav…