-
## Description
I'm trying to generate a calibration cache file for post-training-quantizatio using Polygraphy.
For which I created custom input json file referring to this [https://github.com/NVIDIA/…
-
Hi TensorRT-LLM team, Your work is incredible.
By following the READme file for [multi-modeling](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/multimodal/README.md), we were sucess to run…
-
Hello, I want to deploy llama-3-8b quantized model using tritonserver I followed below steps to do this:
1. create container with nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 base image.
3.…
-
## Description
I am trying to convert onnx model to int8 with latest TensorRT. I got the following error:
```
[05/19/2023-14:42:31] [E] Error[2]: Assertion getter(i) != 0 failed.
[05/19/2023-14…
-
Opening a new issue as #237 was closed prematurely.
It seems that engines built using the `--paged_kv_cache` flag leak GPU memory. Below is a minimal reproducible example code that can be used to …
-
## Description
I am using this [calibration script](https://github.com/rmccorm4/tensorrt-utils/tree/master/int8/calibration) to generate the calib cache file for Segformer onnx model. But facing th…
-
### Search before asking
- [X] I have searched the Ultralytics YOLO [issues](https://github.com/ultralytics/ultralytics/issues) and found no similar bug report.
### Ultralytics YOLO Component
Expo…
-
I follow the readme :
## Build model with both INT8 weight-only and INT8 KV cache enabled
python convert_checkpoint.py --model_dir ./bloom/560m/ \
--dtype float16 \
…
-
Trying to run offline retinanet in a container with one Nvidia GPU:
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev --model=retinanet --implementation=nvidia …
-
I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)
[qwen2-moe-57B-A14B](https://huggingface.co/Qwe…