-
### System Info
L4 GPU
GPU memory: 24 GB
TensorRT LLM version: v0.10.0
container used: tritonserver:24.06-trtllm-python-py3
### Who can help?
@byshiue @schetlur-nv
### Information
- [X] The …
-
Hello mlcommons team,
I want to run the "Automated command to run the benchmark via MLCommons CM" (from the example: https://github.com/mlcommons/inference/tree/master/language/llama2-70b) with a d…
-
**Describe the bug**
When running a docker container running uvicorn + fastapi + an ORT inference session with a single model on a single uvicorn worker, handling at most 3 requests at a time, we reg…
-
I am trying to quantize a [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) PyTorch model. When I run the code using fbgemm backend. I run into the following error.
`AssertionError: Per channel weight…
-
-
Hello maintainters!
In [the release note of 24.08](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-24-08.html#rel-24-08), there is a known issue which is
> Triton met…
-
**Description**
We are encountering an issue with the Triton Inference Server's in-process Python API where the metrics port (default: 8002) does not open. This results in a 'connection refused' er…
yucai updated
4 months ago
-
I have the following problem:
`model=Honkware/openchat_8192-GPTQ
`
`text-generation-launcher --model-id $model --num-shard 1 --quantize gptq --port 8080
`
```
Traceback (most recent call las…
-
Dear Developers:
I'm deploying a GPT model with triton-inference-server and fastertransformer_backend, following this tutorial: https://github.com/triton-inference-server/fastertransformer_backend/…
-
Thank you for your work, I followed the tutorial provided by you to try,
`/usr/bin/apptainer run --nv rf_se3_diffusion.sif -u run_inference.py inference.deterministic=True diffuser.T=100 inference…