Closed skyCreateXian closed 2 days ago
Could u please share us your checkpoint/config.json file's content?
@nv-guomingz Here is the checkpoint config.json: config.json
Got it. Let me try to reproduce it on my side.
Please modify your code base as above. We'll merge the fix in next weekly update.
@nv-guomingz Quantification is working normally, thank you for your support
System Info
Compile the engine using the following command on the latest code branch, resulting in a crash build engine script: ` python convert_checkpoint.py --model_dir vicuna-7b-v1.3 \ --medusa_model_dir lm_head \ --output_dir ./tllm_checkpoint_1gpu_medusa \ --dtype float16 \ --use_weight_only \ --num_medusa_heads 4
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_medusa \ --output_dir ./medusa-engine \ --gemm_plugin float16 \ --speculative_decoding_mode medusa \ --max_batch_size 8 `
When running "trtllm build", a crash occurs with the following message:
` [06/26/2024-12:35:00] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
Traceback (most recent call last): File "/usr/local/bin/trtllm-build", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 492, in main
parallel_build(source, build_config, args.output_dir, workers,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 365, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 324, in build_and_save
engine = build_model(build_config,
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py", line 296, in build_model
model = model_cls.from_checkpoint(ckpt_dir, config=rank_config)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 430, in from_checkpoint
model.load(weights, from_pruned=is_checkpoint_pruned)
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 444, in load
raise RuntimeError(
RuntimeError: Required but not provided tensors:{'medusa_heads.0.lm_head.per_channel_scale', 'medusa_heads.2.lm_head.per_channel_scale', 'medusa_heads.3.lm_head.per_channel_scale', 'medusa_heads.1.lm_head.per_channel_scale'}
`
Who can help?
@ncomly-nvidia How do I compile the medusa and base models for weight quantification?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Crash should not occur
actual behavior
A crash occurred during the build trt phase
additional notes
None