-
Hi team, QQ: does `lightseq` support the followings,
- Convert HuggingFace BERT/RoBERTa models to `int8` precision directly
- If yes, can the converted model be exported to ONNX format directly?
- …
-
I'm trying to run inference with mistral 7b model on triton, however I am running into issues when I try to launch the server from my image. I suspect its an issue with some mpi and triton shared libr…
-
**Description**
I am deploying a YOLOv8 model for object-detection using Triton with ONNX backend on Kubernetes. I have experienced significant CPU throttling in the sidecar container ("queue-proxy")…
-
when I launch multi-gpu triton server
`python scripts/launch_triton_server.py --world_size 4 --model_repo /path/to/model/repo `
Got port in use error
21 09:27:15.346696872 166 chttp2_s…
-
### Your current environment
```text
podman --version
podman version 5.2.3
uname -a
Linux noelo-work 6.10.12-200.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 30 21:38:25 UTC 2024 x86_64 GNU/L…
noelo updated
2 weeks ago
-
Triton provides an extension to the standard gRPC inference api for streaming (`inference.GRPCInferenceService/ModelStreamInfer`), this extension is required to use vLLM backend with triton.
However …
-
### System Info
- Ubuntu 20.04
- NVIDIA A100
### Who can help?
@kaiyux
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported …
-
**Description**
There is an abnormal system memory usage while enabling GPU metrics.
enable GPU metrics:
command: tritonserver --model-repository=/models
**after a long time waiting**
![185854](…
-
I used a fine-tuned llama2 model and built it with awq-int4, tp_size=4 max_input_length=8000, max_output_length=8000with tensorrt-llm.
The model runs perfectly under tensorrt-llm.
When I use Trito…
-
**Description**
I deployed Triton Inference Server on Kubernetes (GKE). To balance the load, I created a Load Balancer Service. As a client, I'm using the Python HTTP client. I was expecting all the …