-
### System Info
- CPU architecture : x86_64
- CPU/Host memory size : 32 GB
- GPU name L4 at g2-standard-8 (GCP)
- GPU memory size 24GB
- TensorRT-LLM branch or tag (e.g., main, v0.10.0)
- Nvi…
-
I was recently deploying hugging face models on the Triton inference server which helped me to increase my GPU utilization and serve multiple models using a single GPU.
Was not able to find good r…
-
We use triton inference server for online inference, Can deeprec processor be used in triton inference server?
-
Hello
get docker image 0.6.0. Just tried to run the two demo command:
1. docker run -it --rm --gpus all \
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.6.0 \
bash -c "cd /project && \
…
-
I have an ensemble model,
model 1 output are 66 cropped images, model 1 is python, I manually resize/padded them to 3 batches with shape
(30, 3, 48, 320), (30, 3, 48, 976), (6, 3, 48, 1280)
(I …
-
From req doc:
**OOTB support for NVidia Triton Inference Server**
- We are going with OpenVINO right now as Triton can not be built right now due to maintenance concerns.
Acceptance criteria:
- Scope…
-
**Is your feature request related to a problem? Please describe.**
1. We would like to try parallel model execution on iGPU+DLA devices. Is it possible to run triton-inference-server on a V3NP or Ori…
-
**Description**
I am trying to build a triton docker image following the https://github.com/triton-inference-server/server/blob/r23.07/docs/customization_guide/build.md#building-with-docker
Using …
-
I am testing on the basic models. Model take input and return the same output of same datatype.
Inference is happening:
2024-08-20 09:35:15,923 - INFO - array_final: array([[103]], dtype=uint8)
a…
-
So far the latest publicly available triton inference server with paddle backend is `paddlepaddle/triton_paddle:21.10` and there are lots of bug fixes since then. I'm experiencing an increasing amount…