-
**Description**
I implemented multi-instance inference across 4 A100 GPUS by following [this](https://triton-inference-server.github.io/pytriton/latest/binding_models/#multi-instance-model-inferenc…
-
https://developer.nvidia.com/nvidia-triton-inference-server
-
### Search before asking
- [X] I have searched the YOLOv8 [issues](https://github.com/ultralytics/ultralytics/issues) and [discussions](https://github.com/ultralytics/ultralytics/discussions) and fou…
-
I need to use locally deployed LLMs for evaluation within my current setup. While setting up LLM monitoring using Phoenix, I require evaluations with the traces, I am only able to find [evaluation llm…
-
Hey,
I tried to do ColBERT model inferencing via Triton server in multiple GPUs instance.
GPU 0 works fine. However, other GPU devices (1,2,3,... etc) crash when running to this line
```D_pac…
-
## Describe the bug
I can not expose triton metrics in deployment - i put ports dsecribtion at Pod.v1 spec and use Triton implementation, but metrics ports can not be recognized.
Triton serv…
-
**Description**
We are encountering an issue with the Triton Inference Server's in-process Python API where the metrics port (default: 8002) does not open. This results in a 'connection refused' er…
yucai updated
1 month ago
-
Hello,
I've been deploying my VQA (Vision Query Answer) model using Triton Server and utilizing the `perf_analyzer` tool for benchmarking. However, using random data for the VQA model leads to unde…
-
### System Info
- CPU architecture : x86_64
- CPU/Host memory size : 32 GB
- GPU name L4 at g2-standard-8 (GCP)
- GPU memory size 24GB
- TensorRT-LLM branch or tag (e.g., main, v0.10.0)
- Nvi…
-
Hello.
I am writing to inquire about the PyTorch version used in the Triton Inference Server 24.01 release.
Upon reviewing the documentation, I noticed that Triton 24.01 includes PyTorch version…