-
图一:报错信息;图二:报错位置,是WhisperEncoding官方源码部分。
trt-llm版本:0.14.0.dev2024091700
直接使用的nvcr.io/nvidia/tritonserver:24.07-py3镜像。
-
**Issue Description:**
During a graceful shutdown of Triton Server, we've observed the following behavior:
- Triton Server is hosting both Model A and Model B.
- Model B can make calls to Model…
-
**Description**
Triton uses over 100% of physical memory and freezes the server when using a decoupled dali model with a long video input.
**Triton Information**
Docker `nvcr.io/nvidia/tritonserv…
-
I'm trying to use Triton to deploy baichuan2-13B inference under bf16 precision. The tritonserver can be started successfully, but when processing client request, it crashed.
- Use TensorRT-LLM v0…
-
There is the potential for the TRTIS detection component async thread to wait forever for a response from the server. This is a [known issue](https://github.com/NVIDIA/triton-inference-server/pull/176…
-
### Proposal to improve performance
_No response_
### Report of performance regression
_No response_
### Misc discussion on performance
To reproduce vLLM's performance benchmark, please…
-
**Description**
According to the Framework matrix (https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html#framework-matrix-2024), 24.05 is supposed to support TensorRT 10.0.6.1. Th…
-
-
I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.
Was not able to find good r…
-
From req doc:
**OOTB support for NVidia Triton Inference Server**
- We are going with OpenVINO right now as Triton can not be built right now due to maintenance concerns.
Acceptance criteria:
- Scope…