-
Here's the overall architecture of Triton:
![image](https://user-images.githubusercontent.com/166481/82379259-74854500-99db-11ea-9928-99370fb74d34.png)
In scope:
- Triton server
- Client SDKs …
yuryu updated
4 years ago
-
Hi there! I'm trying to serve multiple TensorRT-LLM models and I'm wondering what the recommended approach is. I'm using Python to serve TensorRT-LLM models. I've tried / considered:
- `GenerationS…
-
Thank you very much for the incredible project!
First of all, it would be very helpfull that you add a documentation on how to manage GPU memory while using Triton.
I was doing several test but …
-
Neither find_package() nor FetchContent work out of the box for a standalone c++ cmake app.
### find_package
Compile tritonclient manually and set CMAKE_PREFIX_PATH to the install folder. Altern…
-
Since the ingressroutes(https://github.com/triton-inference-server/server/blob/main/deploy/k8s-onprem/templates/ingressroute.yaml) has been deployed as LB to balance requests across all triton pods. H…
-
**Description**
When I use two clients to send `/v2/repository/models/MODEL/load` requests to the same server at the same time, the model is loaded twice
**Triton Information**
What version of Tr…
-
I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and …
-
Im using nvcr.io/nvidia/tritonserver:23.10-py3 container for my inferencing, using C++ GRPC API. There is several models in container, Yolov8-like architecture in Tensorrt plus a few Torchscript model…
-
Currently, I am trying to implement a custom k2 tritonserver backend, but i get this compilation error:
```
In file included from /usr/local/cuda/include/builtin_types.h:59,
from /…
-
I'm trying to run inference with mistral 7b model on triton, however I am running into issues when I try to launch the server from my image. I suspect its an issue with some mpi and triton shared libr…