-
In general, we don't have a very good idea of how well we support TRT LLM. It would be great to see how good this is and if we need to fix anything.
-
### System Info
- CPU: X86
- GPU: NVIDIA L20
- python
- tensorrt 10.3.0
- tensorrt-cu12 10.3.0
- tensorrt-cu12-bindings 10.3.0
- tensorrt-cu12-libs 10…
-
Outlines currently support the vLLM inference engine, it would be great if it could also support the tensorRT-LLM inference engine.
-
We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using ten…
-
We want to deploy https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-bnb-4bit which is 4-bit quantized version of llama-3.2-1B model. It is quantized using bitsandbytes. Can we deploy this using ten…
-
## Goal
- [ ] Support llama3.1 in main TensorRT-LLM engine formats
- [ ] Upload to HF
## User Requests
-
## Problem Description
When trying to use pipeline parallelism in tensorrt-llm on 2+ NVIDIA GPUs, I encounter ```AssertionError: Expected but not provided tensors:{'transformer.vocab_embedding.weig…
-
Hello, `0.15.0.dev2024101500` added a new issue when using the executor API with whisper
```
[TensorRT-LLM][ERROR] IExecutionContext::inferShapes: Error Code 7: Internal Error (WhisperEncoder/__add_…
-
### System Info
Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
tensorrt_llm 0.13.0.dev2024081300
tritonserver 2.48.0
triton image: 24.07
Cuda 12.5
### Wh…
-