Medusa Weight Only Quantize crash

skyCreateXian commented 2 days ago

System Info

Compile the engine using the following command on the latest code branch, resulting in a crash build engine script： ` python convert_checkpoint.py --model_dir vicuna-7b-v1.3 \ --medusa_model_dir lm_head \ --output_dir ./tllm_checkpoint_1gpu_medusa \ --dtype float16 \ --use_weight_only \ --num_medusa_heads 4

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_medusa \ --output_dir ./medusa-engine \ --gemm_plugin float16 \ --speculative_decoding_mode medusa \ --max_batch_size 8 `

When running "trtllm build", a crash occurs with the following message：

` [06/26/2024-12:35:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 492, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 365, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 324, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 296, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 430, in from_checkpoint model.load(weights, from_pruned=is_checkpoint_pruned) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 444, in load raise RuntimeError( RuntimeError: Required but not provided tensors:{'medusa_heads.0.lm_head.per_channel_scale', 'medusa_heads.2.lm_head.per_channel_scale', 'medusa_heads.3.lm_head.per_channel_scale', 'medusa_heads.1.lm_head.per_channel_scale'} `

Who can help?

@ncomly-nvidia How do I compile the medusa and base models for weight quantification？

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Install the latest version v0.11.0.dev.20240625
Prepare the base model and medusa header
Add weighted quantization compilation options and compile them

Expected behavior

Crash should not occur

actual behavior

A crash occurred during the build trt phase

additional notes

None

nv-guomingz commented 2 days ago

Could u please share us your checkpoint/config.json file's content?

skyCreateXian commented 2 days ago

@nv-guomingz Here is the checkpoint config.json: config.json

nv-guomingz commented 2 days ago

Got it. Let me try to reproduce it on my side.

nv-guomingz commented 2 days ago

Please modify your code base as above. We'll merge the fix in next weekly update.

skyCreateXian commented 2 days ago

@nv-guomingz Quantification is working normally, thank you for your support

NVIDIA / TensorRT-LLM