NVIDIA / TensorRT-Model-Optimizer

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, etc. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
https://nvidia.github.io/TensorRT-Model-Optimizer
Other
576 stars 43 forks source link

Cannot export model to the model_config #23

Open ashwin-js opened 5 months ago

ashwin-js commented 5 months ago

I am trying to quantize and export to tensorrt engine a llama 3 finetuned model . But I am able to quantize the model but however I am unable to export to tensorrt format because I am unable to export the model config.

scripts/huggingface_example.sh --type llama --model $HF_PATH --quant int4_awq --tp 4

Output: Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to /app/TensorRT-Model-Optimizer/llm_ptq/saved_models_Gaja-v1_dense_int4_awq_tp4_pp1/modelopt_model.0.pth using torch.save for further inspection. Detailed export error: Weight shape is not divisible for block size for block quantization.

cjluo-omniml commented 5 months ago

For int4awq, we require the weights shape (per tp rank) to be multiple of the blocksize

wangpeilin commented 3 months ago

You can try to modify the args "awq_block_size", I modify this parameter from 128 to 32 and it works.