-
**Is your feature request related to a problem? Please describe.**
The goal of this feature is to simplify feast integration for model serving platforms. Feast feature servers have custom http/grpc i…
-
I would like to use this as a python backend within `triton-inference-server` in order to allow for bringing my production parameters in better alignment with training / validation.
Are there plans…
-
whisper.cpp ships with a [server](https://github.com/ggerganov/whisper.cpp/tree/master/examples/server). Isn't using that faster than loading the model again for each request?
Doing this should be …
-
### Problem Statement
I can see why Gab was confused here: #1329
@vansangpfiev
Can we use the groupings @dan-homebrew /I orginally suggested? 🙏
Current
- Not super accurate bc `chat` shou…
-
### System Info
When using Qwen2, executing inference with the engine through the run.py script outputs normally. However, when using Triton for inference, some characters appear garbled, and the out…
-
Hi,
I noticed there is no slack, discord or irc channel for tensorrt - which could offload some future tickets by discussing things in the channel - so I created one.
I hope its ok to advertise …
-
Timeouts on the inference service should not result in a 503. The rest suppressed logger reported this.
Interestingly the timeout is 10s, while the default is not 30s ?
```
"error.stack_trace": "o…
-
**Description**
We have an ensemble of 2 models chained together (description of models below).
Calling only the "preprocessing" model yields a max throughput of 21500 QPS @ 6 Cpu cores usage
Cal…
-
**Description**
The Triton Inference server is deployed on the only CPU device.
There are about 32 models (onnxruntime).
The Triton Inference server outage during the long load testing. It stops …
-
### System Info
- Ubuntu 20.04
- NVIDIA A100
### Who can help?
@kaiyux
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported …