-
It would be helpful to have an option to specify which GPU to use when running inference on a machine with multiple GPUs. In my case, I am running multiple MONAILabel servers, each with its own dedica…
che85 updated
3 weeks ago
-
Hello,
Thank you for creating [openai-server.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/apps/openai_server.py). It has been very helpful in avoiding the need to use vLLM or other O…
-
-
Hello, I followed the latest instructions when installing on a server using CUDA 12.1. However, when I run python inference.py, I encounter an error as shown in the image. Please help.
I'm using Ubun…
-
Same Issue as reported here: #161
Just downloaded and attempt to run without parameters.
```LOG
2024-11-10T21:05:18.220716Z INFO blue_candle: Starting Blue Candle object detection service
2…
-
## Description
Platform containers reach 100% CPU usage and become unresponsive.
Causes liveness probe to fail and restarts.
## Environment
1. OS (where OpenCTI server runs): Ubuntu 22.04 LT…
-
whisper.cpp ships with a [server](https://github.com/ggerganov/whisper.cpp/tree/master/examples/server). Isn't using that faster than loading the model again for each request?
Doing this should be …
-
### System Info
AWS EC2 instance: g6e.48xlarge
TensorRT-LLM v0.13.0
Triton Inference Server v2.50.0
Nvidia `24.09-py3-min` used as based image for docker template
### Who can help?
@xuanzic
### In…
-
Right now, in experiments I have been running, there is a significant bottleneck in retrieving and saving results in parallel batch inference. This is significantly hindering throughput, as each worke…
-
**Is your feature request related to a problem? Please describe.**
The goal of this feature is to simplify feast integration for model serving platforms. Feast feature servers have custom http/grpc i…