-
is support multi node in triton inference server?
i build llama-7b for tensorrtllm_backend and execute triton inference server
i have a 4 GPUS but triton inference server load only 1 GPUS
imag…
-
### System Info
- tensorrtllm_backend built using Dockerfile.trt_llm_backend
- main branch tesnorrt llm (0.13.0.dev20240813000)
- 8xH100 SXM
- Driver Version: 535.129.03
- CUDA Version: 12.5
…
-
The engine is ok using python to run offline inference with trt-llm.
But when I use triton to run it, it complains like following.
Why is this? The triton server uses more memory than TRT-LLM of…
-
Hi there,
I have been finetuning whisper models using huggingface. Further to convert the model to TensorRT_LLM format, i use a HF script that converts the models from its HF format to the original …
-
**Description**
I was using Triton Server nvcr.io/nvidia/tritonserver:24.04-py3 on my local machine with Windows 10 via docker container. Ie installed latest Nvidia Driver 555.85, and docker containe…
-
**例行检查**
[//]: # (方框内删除已有的空格,填 x 号)
+ [] 我已确认目前没有类似 issue
+ [] 我已确认我已升级到最新版本
+ [] 我已完整查看过项目 README,已确定现有版本无法满足需求
+ [] 我理解并愿意跟进此 issue,协助测试和提供反馈
+ [] 我理解并认可上述内容,并理解项目维护者精力有限,**不遵循规则的 issue 可能会被…
-
### Branch/Tag/Commit
v5.2
### Docker Image Version
22.08-py3
### GPU name
V100
### CUDA Driver
none
### Reproduced Steps
```shell
use the fastertransformer triton backend …
-
### System Info
hi,
i generated the tensorrt llm engine for a llama based model and see that the performance is much worse than vllm.
i did the following:
- compile model with tensorrt llm c…
-
[MLIR LSP server](https://mlir.llvm.org/docs/Tools/MLIRLSP/) is a tool for IDE to understand `.mlir` files of various dialects. By integrating with `mlir-lsp` related tools, we can make IDE aware of t…
-
**Description**
Im using a simple client inference class base on client example. My tensorRT inference with batchsize 10 with 150ms and my triton with tensorRT backend took 1100ms. This is my client:…