TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Traceback (most recent call last):
File "/workspace/TensorRT-LLM/examples/bloom/convert_checkpoint.py", line 899, in
weights = convert_hf_bloom(
File "/workspace/TensorRT-LLM/examples/bloom/convert_checkpoint.py", line 668, in convert_hf_bloom
np.array([1.0 / int8_weights['scale_y_quant_orig']],
UnboundLocalError: local variable 'int8_weights' referenced before assignment
The code in convert_checkpoint.py shows that if use_smooth_quant ==False, the int8_weights will not been calculate.
I follow the readme :
Build model with both INT8 weight-only and INT8 KV cache enabled
python convert_checkpoint.py --model_dir ./bloom/560m/ \ --dtype float16 \ --int8_kv_cache \ --use_weight_only --output_dir ./bloom/560m/trt_ckpt/int8/1-gpu/ trtllm-build --checkpoint_dir ./bloom/560m/trt_ckpt/int8/1-gpu/ \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --output_dir ./bloom/560m/trt_engines/int8/1-gpu/
and my script is
python convert_checkpoint.py --model_dir ./Bloomz_QA+alpaca_gpt4_zh+lima_V3 \ --dtype float16 \ --int8_kv_cache \ --use_weight_only --output_dir ./Bloomz_QA+alpaca_gpt4_zh+lima_V3/trt_ckpt/int8/1-gpu/ trtllm-build --checkpoint_dir ./Bloomz_QA+alpaca_gpt4_zh+lima_V3//trt_ckpt/int8/1-gpu/ \ --use_gemm_plugin float16 \ --use_gpt_attention_plugin float16 \ --output_dir ./Bloomz_QA+alpaca_gpt4_zh+lima_V3/trt_engines/int8/1-gpu/
and I got
Traceback (most recent call last): File "/workspace/TensorRT-LLM/examples/bloom/convert_checkpoint.py", line 899, in
weights = convert_hf_bloom(
File "/workspace/TensorRT-LLM/examples/bloom/convert_checkpoint.py", line 668, in convert_hf_bloom
np.array([1.0 / int8_weights['scale_y_quant_orig']],
UnboundLocalError: local variable 'int8_weights' referenced before assignment
The code in convert_checkpoint.py shows that if use_smooth_quant ==False, the int8_weights will not been calculate.