Model not converted to dtype when quantizing, causes engine building issues

I'm using the quantization script in examples/quantization and I'm running into an issue where I'm quantizing Mistral 7B to int4_awq and since Mistral 7B is bfloat16, I need to use bfloat16 dtype in the quantize.py arguments as it doesn't convert to float16 if specified. Then, when I try to run trtllm-build with the quantized model files, it fails since trtllm-build does not work with bfloat16 seemingly.

However, when modifying the quantize.py script to explicitly load the model with torch_dtype=dtype instead of torch_dypte="auto", then I can quantize the model from float16 to int4_awq and trtllm-build works fine. I'm referencing this line here.

I am using TensorRT-LLM version 0.9.0.dev2024020600 and I'm running it in the nvidia/cuda:12.1.1-devel-ubuntu22.04 container. Also, I've made a clone of the main branch of this repo in order to run the quantize.py file and other example scripts. Could there be issues with running a different tensorrt-llm version than this repo expects, or is this the intentional way to run tensorrt-llm with the example scripts?

NVIDIA / TensorRT-LLM

Model not converted to dtype when quantizing, causes engine building issues #1078