NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.4k stars 800 forks source link

AssertionError: The value updated is not the same shape as the original. Updated: (32016, 5120), original: (32000, 5120) #713

Open shatealaboxiaowang opened 6 months ago

shatealaboxiaowang commented 6 months ago

I used awq to build the codellama-13b quantized npz model file to tensorrt format, but encountered this error. My command was as follows:

python build.py --model_dir /app/models/CodeLlama-13b-hf/ \ --quant_ckpt_path /app/models/quant/CodeLlama-13B-int4-awq-for-tensorrt/llama_tp1_rank0.npz --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --use_weight_only \ --use_inflight_batching \ --weight_only_precision int4_awq \ --per_group \ --output_dir /home/models/tensorRT-engines/CodeLlama-13b-AWQ_1-gpu/ \ --rotary_base 1000000 --vocab_size 32016 --world_size 1

Tracin commented 6 months ago

Hi @shatealaboxiaowang I think this bug has fixed in previous version, did you run this with v0.6.1?

shatealaboxiaowang commented 6 months ago

Hi @shatealaboxiaowang I think this bug has fixed in previous version, did you run this with v0.6.1?

yes, i run it with v0.6.1,but it still gives this error.

Tracin commented 6 months ago

@shatealaboxiaowang The code change is in v0.7.0 code Please try with v0.7.0

shatealaboxiaowang commented 6 months ago

@shatealaboxiaowang The code change is in v0.7.0 code Please try with v0.7.0

Thank you for your reply. I still have the same problem when I change the code based on 0.6.1 according https://github.com/NVIDIA/TensorRT-LLM/blob/v0.7.0/examples/llama/build.py#L557-L560, but I will try using 0.7.0 soon.

Broyojo commented 5 months ago

I'm experiencing this same issue with 0.8.0.dev2024011601 when building an engine for quantized Starling 7B alpha.

$ python ../quantization/quantize.py --model_dir ~/.cache/huggingface/hub/models--berkeley-nest--Starling-LM-7B-alpha/snapshots/f721e85293598f2ef774e483ae95343e39811577 \
                                         --dtype float16 \
                                         --qformat int4_awq \
                                         --export_path ./quantized_int4-awq \
                                         --calib_size 32

$ python3 build.py \
    --model_dir ~/.cache/huggingface/hub/models--berkeley-nest--Starling-LM-7B-alpha/snapshots/f721e85293598f2ef774e483ae95343e39811577 \
    --quant_ckpt_path ./quantized_int4-awq/llama_tp1_rank0.npz \
    --dtype float16 \
    --remove_input_padding \
    --use_gpt_attention_plugin float16 \
    --enable_context_fmha \
    --use_gemm_plugin float16 \
    --output_dir ./tmp/starling/7B/trt_engines/int4_AWQ/1-gpu/ \
    --use_weight_only \
    --weight_only_precision int4_awq \
    --per_group \
    --use_inflight_batching
aiiAtelier commented 2 weeks ago

Is the issue still on?