-
A few options to explore
1. NVIDIA NeMo, TensorRT_LLM, Triton
- NeMo
Run [this Generative AI example](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/models/Gemma
) to build Lora wi…
-
Triton provides an extension to the standard gRPC inference api for streaming (`inference.GRPCInferenceService/ModelStreamInfer`), this extension is required to use vLLM backend with triton.
However …
-
**Description**
I am using the Triton Inference Server with a TensorRT backend, Sequence Batching, Old Batching Strategy and Implicit State Management. I would like to find the most efficient method …
-
We are using Triton Inference Server for model inference and currently facing throughput bottlenecks with LLM inference. I saw in a public video that Nvidia has optimized LLM serving by supporting `In…
-
Hi Team,
Any updates on Inflight Batching support with Triton via Python client?
Thanks!
-
Hi experts,
I'm running a 1.3B model on windows with 16GB V100 with below envs, but hit an issue which I couldn't find any clue. Could you please help check it.
TensorRT-LLM version: tag v0.10.0…
-
### System Info
GPU Nvidia A10G
Cuda version 12.3
Driver version 535.183.01
TensorRT-LLM v0.8.0
Image nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 (was used to build the tensorrt engine an…
-
when i use `n` option is different as openai.
when i use n it turn to use beam search.
-
### System Info
arch - x86-64
gpu - rtx3070
docker image nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3
tensorRT-LLM-backend tag - 0.7.2
tensorRT-LLM tag - 0.7.1 (80bc07510ac4ddf13c0d76ad2…
-
**Description**
A clear and concise description of what the bug is.
r23.04
```
I0718 11:39:24.385839 1 server.cc:653]
| Model | Version | Status …