Open shatealaboxiaowang opened 6 months ago
Hi @shatealaboxiaowang I think this bug has fixed in previous version, did you run this with v0.6.1?
Hi @shatealaboxiaowang I think this bug has fixed in previous version, did you run this with v0.6.1?
yes, i run it with v0.6.1,but it still gives this error.
@shatealaboxiaowang The code change is in v0.7.0 code Please try with v0.7.0
@shatealaboxiaowang The code change is in v0.7.0 code Please try with v0.7.0
Thank you for your reply. I still have the same problem when I change the code based on 0.6.1 according https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.0/examples/llama/build.py#L557-L560, but I will try using 0.7.0 soon.
I'm experiencing this same issue with 0.8.0.dev2024011601 when building an engine for quantized Starling 7B alpha.
$ python ../quantization/quantize.py --model_dir ~/.cache/huggingface/hub/models--berkeley-nest--Starling-LM-7B-alpha/snapshots/f721e85293598f2ef774e483ae95343e39811577 \
--dtype float16 \
--qformat int4_awq \
--export_path ./quantized_int4-awq \
--calib_size 32
$ python3 build.py \
--model_dir ~/.cache/huggingface/hub/models--berkeley-nest--Starling-LM-7B-alpha/snapshots/f721e85293598f2ef774e483ae95343e39811577 \
--quant_ckpt_path ./quantized_int4-awq/llama_tp1_rank0.npz \
--dtype float16 \
--remove_input_padding \
--use_gpt_attention_plugin float16 \
--enable_context_fmha \
--use_gemm_plugin float16 \
--output_dir ./tmp/starling/7B/trt_engines/int4_AWQ/1-gpu/ \
--use_weight_only \
--weight_only_precision int4_awq \
--per_group \
--use_inflight_batching
Is the issue still on?
I used awq to build the codellama-13b quantized npz model file to tensorrt format, but encountered this error. My command was as follows:
python build.py --model_dir /app/models/CodeLlama-13b-hf/ \ --quant_ckpt_path /app/models/quant/CodeLlama-13B-int4-awq-for-tensorrt/llama_tp1_rank0.npz --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --use_weight_only \ --use_inflight_batching \ --weight_only_precision int4_awq \ --per_group \ --output_dir /home/models/tensorRT-engines/CodeLlama-13b-AWQ_1-gpu/ \ --rotary_base 1000000 --vocab_size 32016 --world_size 1