NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

tensor parallel failed but pipeline parallel success for qwen 2 when doing checkpoint convertion #1662

Open 2019zhou opened 4 months ago

2019zhou commented 4 months ago

System Info

Who can help?

@T

Information

Tasks

Reproduction

my command to convert the checkpoints failed

python3 convert_checkpoint.py \
    --model_dir         ./Qwen1.5-72B-Chat-GPTQ-Int4/ \
    --output_dir        ./tllm_checkpoint_2gpu_gptq/ \
    --dtype float16 \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group \
    --load_model_on_cpu \
    --tp_size 2 \
    --n_head 64 \
    --n_kv_head 64 \
    --qwen_type qwen2

with error log

ValueError: You are trying to save a non contiguous tensor: `transformer.layers.0.mlp.fc.weights_scaling_factor` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

since some issue says this error relates to transformers version I tested the transformers 3.38.2 and 4.41 but all failed.

My running

tp_size = 2 fail
tp_size = 1, pp_size = 1 success
pp_size = 2 success

Expected behavior

convert success

actual behavior

ValueError: You are trying to save a non contiguous tensor: `encoder_decoder.encoder.block.0.layer.0.SelfAttention.q.weight` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

additional notes

byshiue commented 4 months ago

Sorry for dealy response. Could you take a try on latest main branch?

commit b777bd64750abf30ca7eda48e8b6ba3c5174aafd (HEAD, origin/main)
Author: Kaiyu Xie <26294424+kaiyux@users.noreply.github.com>
Date:   Tue Jun 4 20:26:32 2024 +0800

    Update TensorRT-LLM (#1725)

    * Update TensorRT-LLM

We have updated the QWEN recently and I can run on main branch successfully.