NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

Medusa Weight Only Quantize crash #1845

Closed skyCreateXian closed 2 days ago

skyCreateXian commented 2 days ago

System Info

Compile the engine using the following command on the latest code branch, resulting in a crash build engine script: ` python convert_checkpoint.py --model_dir vicuna-7b-v1.3 \ --medusa_model_dir lm_head \ --output_dir ./tllm_checkpoint_1gpu_medusa \ --dtype float16 \ --use_weight_only \ --num_medusa_heads 4

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_medusa \ --output_dir ./medusa-engine \ --gemm_plugin float16 \ --speculative_decoding_mode medusa \ --max_batch_size 8 `

When running "trtllm build", a crash occurs with the following message:

` [06/26/2024-12:35:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 492, in main parallel_build(source, build_config, args.output_dir, workers, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 365, in parallel_build passed = build_and_save(rank, rank % workers, ckpt_dir, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 324, in build_and_save engine = build_model(build_config, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 296, in build_model model = model_cls.from_checkpoint(ckpt_dir, config=rank_config) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 430, in from_checkpoint model.load(weights, from_pruned=is_checkpoint_pruned) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 444, in load raise RuntimeError( RuntimeError: Required but not provided tensors:{'medusa_heads.0.lm_head.per_channel_scale', 'medusa_heads.2.lm_head.per_channel_scale', 'medusa_heads.3.lm_head.per_channel_scale', 'medusa_heads.1.lm_head.per_channel_scale'} `

Who can help?

@ncomly-nvidia How do I compile the medusa and base models for weight quantification?

Information

Tasks

Reproduction

  1. Install the latest version v0.11.0.dev.20240625
  2. Prepare the base model and medusa header
  3. Add weighted quantization compilation options and compile them

Expected behavior

Crash should not occur

actual behavior

A crash occurred during the build trt phase

additional notes

None

nv-guomingz commented 2 days ago

Could u please share us your checkpoint/config.json file's content?

skyCreateXian commented 2 days ago

@nv-guomingz Here is the checkpoint config.json: config.json

nv-guomingz commented 2 days ago

Got it. Let me try to reproduce it on my side.

nv-guomingz commented 2 days ago
image

Please modify your code base as above. We'll merge the fix in next weekly update.

skyCreateXian commented 2 days ago

@nv-guomingz Quantification is working normally, thank you for your support