-
My Gpu Config
Tensorrt Engine Build Command
python3 build.py --model_dir /opt/llms/llama-7b
--dtype float16
--remove_i…
-
The question is how do you free memory
https://github.com/triton-inference-server/onnxruntime_backend/issues/103
When the model is deployed to a single card, I can specify real-time release of…
-
**Description**
I've loaded a model via `v2/repository/models/simple/load` endpoint.
But when querying `v2/repository/index` endpoint I get a `[]` as a responce.
**Triton Information**
What ver…
-
**Is your feature request related to a problem? Please describe.**
This issue is similar to the one mentioned here: https://github.com/triton-inference-server/server/issues/7287. I'd like to file an …
-
**Description**
When starting Triton Server with tracing and with a generic model (e.g., `identity_model_fp32` from the Python backend example), the server crashes with signal 11 after handling a f…
-
We discussed interacting with image models (both for predictions and for embeddings) via an API rather than directly from python
* Simplify adding pipeline stages (and replacing pipeline frameworks…
-
## Bug Description
I'm trying to serve torch-tensorrt optimized model to Nvidia Triton server based on the provided tutorial
https://pytorch.org/TensorRT/tutorials/serving_torch_tensorrt_with_t…
-
### Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related iss…
-
**Description**
If I loaded 2 model transformer and inference model, memory GPU used about 3Gi.
```
PID USER DEV TYPE GPU GPU MEM CPU HOST MEM Command
2207044 coreai 0 C…
-
#### Description
I am currently working on deploying the Seamless M4T model for text-to-text translation on a Triton server. I have successfully exported the `text.encoder` to ONNX and traced it …