-
**Describe the bug**
I want to deploy the trt engine with triton-inference-server, but it can't load the trt model.
**To Reproduce**
I've converted the trt engine file from mmdet model with doc…
-
**Description**
I'm trying to serve an embedding model [FastText] in triton-server using python as its backend. The external dependencies are just fasttext module which is inturn dependent on numpy. …
-
**Is your feature request related to a problem? Please describe.**
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Infere…
-
A few options to explore
1. NVIDIA NeMo, TensorRT_LLM, Triton
- NeMo
Run [this Generative AI example](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/models/Gemma
) to build Lora wi…
-
The server seems to be ok with the following log.
```
I1212 03:29:51.067415 37860 server.cc:674]
+----------------+---------+--------+
| Model | Version | Status |
+----------------+---…
-
the newest version support Qwen2-VL? especially the mrope param need to be sent to LLM
-
## Description
When requesting token metrics from an endpoint running a LMI container using a vLLM engine, **non-zero** values are returned for tokenThroughput, totalTokens, and tokenPerRequest (**as…
-
**Description**
When deploying an ONNX model using the Triton Inference Server's ONNX runtime backend, the inference performance on the CPU is noticeably slower compared to running the same model usi…
-
**Description**
Hi, I have setup Triton version 2.47 for Windows, along with ONNX runtime backend, based on the assets for Triton 2.47 that are mentioned in this URL : https://github.com/triton-infer…
-
Tracking the second round of issues submitted to [triton-inference-server](https://github.com/triton-inference-server/server):
- [ ] https://github.com/triton-inference-server/server/issues/2018: Con…