NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

Qwen2 1.5B checkpoint conversion broken #1785

Closed yaysummeriscoming closed 2 months ago

yaysummeriscoming commented 3 months ago

System Info

Running TRTLLM 0.10, container built from nvidia/cuda:12.4.0-devel-ubuntu22.04

Who can help?

@byshiue

Information

Tasks

Reproduction

git clone https://huggingface.co/Qwen/Qwen2-1.5B-Instruct   ./tmp/Qwen2/1.5B

CUDA_VISIBLE_DEVICES=$GPU_ID python3 convert_checkpoint.py \
    --qwen_type qwen2 \
    --model_dir ./tmp/Qwen2/1.5B \
    --dtype float16 \
    --output_dir ./tmp/Qwen2/1.5B/converted

Expected behavior

Conversion works

actual behavior

  File "convert_checkpoint_qwen.py", line 373, in <module>
    main()
  File "convert_checkpoint_qwen.py", line 365, in main
    convert_and_save_hf(args)
  File "convert_checkpoint_qwen.py", line 327, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "convert_checkpoint_qwen.py", line 333, in execute
    f(args, rank)
  File "convert_checkpoint_qwen.py", line 313, in convert_and_save_rank
    qwen = from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1134, in from_hugging_face
    weights = load_weights_from_hf(config=config,
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1242, in load_weights_from_hf
    weights = convert_hf_qwen(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 980, in convert_hf_qwen
    lm_head_weights = get_weight(model_params, 'lm_head', dtype)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 455, in get_weight
    if config[prefix + '.weight'].dtype != dtype:
KeyError: 'lm_head.weight'

additional notes

Qwen2 7B conversion works for me, but 1.5B conversion is broken.

Side question: There's no pre-built Tensorrt-LLM development container right, we need to build it ourselves?

nv-guomingz commented 3 months ago

It's a known issue in 0.10.0 and please have a try with latest main branch or https://pypi.org/project/tensorrt-llm/0.11.0.dev2024061100/

yaysummeriscoming commented 3 months ago

That did the trick, thank you!

yaysummeriscoming commented 3 months ago

I'm afraid it wasn't fixed for weight-only quantization:

git clone https://huggingface.co/Qwen/Qwen2-1.5B-Instruct   ./tmp/Qwen2/1.5B

CUDA_VISIBLE_DEVICES=$GPU_ID python3 convert_checkpoint.py \
    --qwen_type qwen2 \
    --model_dir ./tmp/Qwen2/1.5B \
    --dtype float16 \
    --output_dir ./tmp/Qwen2/1.5B/converted \
    --use_weight_only \
    --weight_only_precision int8

Gives:

  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1128, in from_hugging_face
    qwen.load(weights)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 438, in load
    raise RuntimeError(
RuntimeError: Required but not provided tensors:{'transformer.vocab_embedding.per_token_scale'}
yaysummeriscoming commented 3 months ago

@nv-guomingz following up on this?

nv-guomingz commented 3 months ago

Hi @yaysummeriscoming thanks for your patience.

Base on my local testing, the latest code base has solved such issue. We'll have a weekly update soon please try with it.

image
yaysummeriscoming commented 2 months ago

That did the trick, thank you!

white-wolf-tech commented 2 months ago

when I use W8A8 in 1.5B model in the release branch v0.11.0. some error occur.

" quantize(args.dtype, File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1379, in quantize safetensors.torch.save_file( File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 284, in save_file serialize_file(_flatten(tensors), filename, metadata=metadata) File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 480, in _flatten raise RuntimeError( RuntimeError: Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'transformer.vocab_embedding.weight', 'lm_head.weight'}]. A potential way to correctly save your model is to use save_model. More information at https://huggingface.co/docs/safetensors/torch_shared_tensors"

can not save the model weight, Has anyone encountered this situation?