tensor parallel failed but pipeline parallel success for qwen 2 when doing checkpoint convertion

System Info

CPU x86_84
converted using CPU only
TensorRT-LLM 0.11.0 recompiled using python3.9
try to deploy on 2 * A30

Who can help?

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

my command to convert the checkpoints failed

python3 convert_checkpoint.py \
    --model_dir         ./Qwen1.5-72B-Chat-GPTQ-Int4/ \
    --output_dir        ./tllm_checkpoint_2gpu_gptq/ \
    --dtype float16 \
    --use_weight_only \
    --weight_only_precision int4_gptq \
    --per_group \
    --load_model_on_cpu \
    --tp_size 2 \
    --n_head 64 \
    --n_kv_head 64 \
    --qwen_type qwen2

with error log

ValueError: You are trying to save a non contiguous tensor: `transformer.layers.0.mlp.fc.weights_scaling_factor` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

since some issue says this error relates to transformers version I tested the transformers 3.38.2 and 4.41 but all failed.

My running

tp_size = 2 fail
tp_size = 1, pp_size = 1 success
pp_size = 2 success

Expected behavior

convert success

actual behavior

ValueError: You are trying to save a non contiguous tensor: `encoder_decoder.encoder.block.0.layer.0.SelfAttention.q.weight` which is not allowed. It either means you are trying to save tensors which are reference of each other in which case it's recommended to save only the full tensors, and reslice at load time, or simply call `.contiguous()` on your tensor to pack it before saving.

additional notes

tensorrt 10.0.1
tensorrt-dispatch 10.0.1
tensorrt-lean 10.0.1
tensorrt-llm 0.11.0.dev2024051400
transformers 4.38.2

NVIDIA / TensorRT-LLM

tensor parallel failed but pipeline parallel success for qwen 2 when doing checkpoint convertion #1662