NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.22k stars 912 forks source link

Model not converted to dtype when quantizing, causes engine building issues #1078

Open Broyojo opened 7 months ago

Broyojo commented 7 months ago

I'm using the quantization script in examples/quantization and I'm running into an issue where I'm quantizing Mistral 7B to int4_awq and since Mistral 7B is bfloat16, I need to use bfloat16 dtype in the quantize.py arguments as it doesn't convert to float16 if specified. Then, when I try to run trtllm-build with the quantized model files, it fails since trtllm-build does not work with bfloat16 seemingly.

However, when modifying the quantize.py script to explicitly load the model with torch_dtype=dtype instead of torch_dypte="auto", then I can quantize the model from float16 to int4_awq and trtllm-build works fine. I'm referencing this line here.

I am using TensorRT-LLM version 0.9.0.dev2024020600 and I'm running it in the nvidia/cuda:12.1.1-devel-ubuntu22.04 container. Also, I've made a clone of the main branch of this repo in order to run the quantize.py file and other example scripts. Could there be issues with running a different tensorrt-llm version than this repo expects, or is this the intentional way to run tensorrt-llm with the example scripts?

byshiue commented 7 months ago

Please follow the issue template to share the reproduced steps. Thank you for cooperation.