-
**Description**
sagemaker_server.cc exposes load/unload-ing models through an http post request to SageMaker. I'm unable to load or unload models through SageMaker for Triton. I'm currently testing l…
-
TRT-LLM version: **0.5.0**
Triton server version: **23.10**
GPU type: A100, 80GB, with MIG enabled (20gb GPU memory per split, 3 splits per node).
I am trying to run a Falcon-7B model with TRT-LL…
-
I have nvidia and nvidia-docker installed. I have a 1060 with 6GB vram it has compute 6.0
how can I troubleshoot this? when I run other nvidia containers on my PC I have to use --privileged to get …
-
triton 2.10 supports onnx but still got error when loading model.
Release link: [(https://github.com/triton-inference-server/server/releases)]
Input or output layers are empty
![image](https://…
-
https://github.com/triton-inference-server/
- [x] Build Triton Docker image with support for FasterTransformer backend for Fusion etc.
- [x] convert h2oGPT models to format that Triton understands h…
-
**Description**
The Python backend does not properly load the `model.py` file in the model directory when trailing slashes (`/`) are present in the `--backend-directory` option.
**Triton Informa…
-
Hi, I am able to reproduce building and running the model locally via TensorRT-LLM.
I build using:
```
python3 build.py --model_dir /finetune-gpt-neox/models--meta-llama--Llama-2-7b-hf/snapsho…
-
# 🐛 Bug
```
C:\Users\ZeroCool22\Desktop\SwarmUI\dlbackend\comfy>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build
[START] Security scan
[DONE] Security scan
## ComfyUI-M…
-
**Description**
When using ORT-TRT backend on GPU, the CPU memory usage is as high as the usage when we use CPU inference.
**Triton Information**
What version of Triton are you using?
2.45.0
…
-
Do you support Exllamav2 backend for the inference that supports exl quants?
The current alternative is vllm but that doesn't support EXL quants. Also, after running a perplexity test, EXL is the b…