-
Model under test: Llama-2-7b-chat-hf
Following the instructions [here](https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0/examples/llama#awq), was able to quantize the model and build engine…
-
python quantize.py --model_dir /qwen-14b-chat --dtype float16 --qformat int4_awq --export_path ./qwen_14b_4bit_gs128_awq.pt --calib_size 32
python build.py --hf_model_dir=/qwen-14b-chat/ --quant…
tbup updated
6 months ago
-
### System Info
- CPU architecture: x86_64
- GPU name: NVIDIA A40, 46GB
- TensorRT-LLM: v0.9.0
- Os: Ubuntu 20.04
- Nvidia Driver: 535.54.03, Cuda: 12.2
### Who can help?
@kaiyux @byshiue…
-
I tried to convert RT-DETR-R18 from onnx to tensorrt, and I succeeded in int8, failed in fp16.
torch2onnx in STATIC: python tools/export_onnx.py
onnx2trt: ./trtexec --onnx=rtdetr.onnx --saveEngin…
-
I am trying to deploy a Baichuan2-7B model on a machine with 2 Tesla V100 GPUs. Unfortunately each V100 has only 16GB memory.
I have applied INT8 weight-only quantization, so the size of the engine I…
-
Thanks for this excellent project!
I can generate a bfloat16 model or an int8 weight model,but wehn I tried the following commands:
python ./examples/llama/build.py --model_dir ./Mixtral-8x7B-Inst…
-
**What is your question?**
Hello, thanks for your project.
cutlass version: 2.10
device RTX 3090
I want to implement a W4A4 conv quantization in tensorrt_llm by cutlass.
Follow the example and do…
-
## Description
## Environment
**TensorRT Version**: 8.5
**CUDA Version**: 11.4
**CUDNN Version**: 8.6
Operating System:
Python Version (if applicable): 3.8.10
PyTorch …
-
## Description
I am trying to figure out if TensoRT and the `pytorch_quantization` module support post-training quantization for vision transformers.
The following piece of code follows the `pyt…
-
### System Info
- GPU Name: T4 X2
- System Ram: 30GB
### Who can help?
_No response_
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### Reproducti…