NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.03k stars 885 forks source link

Qwen2-72B-Instruct-GPTQ-Int4 Conversion Success, Run Failure #1920

Open linchpinlin opened 1 month ago

linchpinlin commented 1 month ago

System Info

NVIDIA-SMI 535.154.05
Driver Version: 535.154.05
CUDA Version: 12.4

Who can help?

No response

Information

Tasks

Reproduction

All my operations are inside nvcr.io/nvidia/tensorrt:24.05-py3. cd /workspace/TensorRT-LLM/examples/qwen python3 convert_checkpoint.py --model_dir /container_dir/Qwen/models--Qwen--Qwen2-72B-Instruct-GPTQ-Int4/snapshots/6b82a333287651211b1cae443ff2d2a6802597b9/ \ --output_dir /container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_checkpoint_2gpu_Int4 \ --dtype float16 \ --use_weight_only \ --weight_only_precision int4_gptq \ --per_group \ --tp_size 2

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070200
0.12.0.dev2024070200
[07/09/2024-06:47:10] CUDA extension not installed.
[07/09/2024-06:47:10] CUDA extension not installed.
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4371: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:08<00:00,  1.26it/s]
Some weights of the model checkpoint at /container_dir/Qwen/models--Qwen--Qwen2-72B-Instruct-GPTQ-Int4/snapshots/6b82a333287651211b1cae443ff2d2a6802597b9/ were not used when initializing Qwen2ForCausalLM: ['model.layers.0.mlp.down_proj.bias', 'model.layers.0.mlp.gate_proj.bias', 'model.layers.0.mlp.up_proj.bias', 'model.layers.0.self_attn.o_proj.bias', 'model.layers.1.mlp.down_proj.bias', 'model.layers.1.mlp.gate_proj.bias', 'model.layers.1.mlp.up_proj.bias', 'model.layers.1.self_attn.o_proj.bias', 'model.layers.10.mlp.down_proj.bias', 'model.layers.10.mlp.gate_proj.bias', 'model.layers.10.mlp.up_proj.bias', 'model.layers.10.self_attn.o_proj.bias', 'model.layers.11.mlp.down_proj.bias', 'model.layers.11.mlp.gate_proj.bias', 'model.layers.11.mlp.up_proj.bias', 'model.layers.11.self_attn.o_proj.bias', 'model.layers.12.mlp.down_proj.bias', 'model.layers.12.mlp.gate_proj.bias', 'model.layers.12.mlp.up_proj.bias', 'model.layers.12.self_attn.o_proj.bias', 'model.layers.13.mlp.down_proj.bias', 'model.layers.13.mlp.gate_proj.bias', 'model.layers.13.mlp.up_proj.bias', 'model.layers.13.self_attn.o_proj.bias', 'model.layers.14.mlp.down_proj.bias', 'model.layers.14.mlp.gate_proj.bias', 'model.layers.14.mlp.up_proj.bias', 'model.layers.14.self_attn.o_proj.bias', 'model.layers.15.mlp.down_proj.bias', 'model.layers.15.mlp.gate_proj.bias', 'model.layers.15.mlp.up_proj.bias', 'model.layers.15.self_attn.o_proj.bias', 'model.layers.16.mlp.down_proj.bias', 'model.layers.16.mlp.gate_proj.bias', 'model.layers.16.mlp.up_proj.bias', 'model.layers.16.self_attn.o_proj.bias', 'model.layers.17.mlp.down_proj.bias', 'model.layers.17.mlp.gate_proj.bias', 'model.layers.17.mlp.up_proj.bias', 'model.layers.17.self_attn.o_proj.bias', 'model.layers.18.mlp.down_proj.bias', 'model.layers.18.mlp.gate_proj.bias', 'model.layers.18.mlp.up_proj.bias', 'model.layers.18.self_attn.o_proj.bias', 'model.layers.19.mlp.down_proj.bias', 'model.layers.19.mlp.gate_proj.bias', 'model.layers.19.mlp.up_proj.bias', 'model.layers.19.self_attn.o_proj.bias', 'model.layers.2.mlp.down_proj.bias', 'model.layers.2.mlp.gate_proj.bias', 'model.layers.2.mlp.up_proj.bias', 'model.layers.2.self_attn.o_proj.bias', 'model.layers.20.mlp.down_proj.bias', 'model.layers.20.mlp.gate_proj.bias', 'model.layers.20.mlp.up_proj.bias', 'model.layers.20.self_attn.o_proj.bias', 'model.layers.21.mlp.down_proj.bias', 'model.layers.21.mlp.gate_proj.bias', 'model.layers.21.mlp.up_proj.bias', 'model.layers.21.self_attn.o_proj.bias', 'model.layers.22.mlp.down_proj.bias', 'model.layers.22.mlp.gate_proj.bias', 'model.layers.22.mlp.up_proj.bias', 'model.layers.22.self_attn.o_proj.bias', 'model.layers.23.mlp.down_proj.bias', 'model.layers.23.mlp.gate_proj.bias', 'model.layers.23.mlp.up_proj.bias', 'model.layers.23.self_attn.o_proj.bias', 'model.layers.24.mlp.down_proj.bias', 'model.layers.24.mlp.gate_proj.bias', 'model.layers.24.mlp.up_proj.bias', 'model.layers.24.self_attn.o_proj.bias', 'model.layers.25.mlp.down_proj.bias', 'model.layers.25.mlp.gate_proj.bias', 'model.layers.25.mlp.up_proj.bias', 'model.layers.25.self_attn.o_proj.bias', 'model.layers.26.mlp.down_proj.bias', 'model.layers.26.mlp.gate_proj.bias', 'model.layers.26.mlp.up_proj.bias', 'model.layers.26.self_attn.o_proj.bias', 'model.layers.27.mlp.down_proj.bias', 'model.layers.27.mlp.gate_proj.bias', 'model.layers.27.mlp.up_proj.bias', 'model.layers.27.self_attn.o_proj.bias', 'model.layers.28.mlp.down_proj.bias', 'model.layers.28.mlp.gate_proj.bias', 'model.layers.28.mlp.up_proj.bias', 'model.layers.28.self_attn.o_proj.bias', 'model.layers.29.mlp.down_proj.bias', 'model.layers.29.mlp.gate_proj.bias', 'model.layers.29.mlp.up_proj.bias', 'model.layers.29.self_attn.o_proj.bias', 'model.layers.3.mlp.down_proj.bias', 'model.layers.3.mlp.gate_proj.bias', 'model.layers.3.mlp.up_proj.bias', 'model.layers.3.self_attn.o_proj.bias', 'model.layers.30.mlp.down_proj.bias', 'model.layers.30.mlp.gate_proj.bias', 'model.layers.30.mlp.up_proj.bias', 'model.layers.30.self_attn.o_proj.bias', 'model.layers.31.mlp.down_proj.bias', 'model.layers.31.mlp.gate_proj.bias', 'model.layers.31.mlp.up_proj.bias', 'model.layers.31.self_attn.o_proj.bias', 'model.layers.32.mlp.down_proj.bias', 'model.layers.32.mlp.gate_proj.bias', 'model.layers.32.mlp.up_proj.bias', 'model.layers.32.self_attn.o_proj.bias', 'model.layers.33.mlp.down_proj.bias', 'model.layers.33.mlp.gate_proj.bias', 'model.layers.33.mlp.up_proj.bias', 'model.layers.33.self_attn.o_proj.bias', 'model.layers.34.mlp.down_proj.bias', 'model.layers.34.mlp.gate_proj.bias', 'model.layers.34.mlp.up_proj.bias', 'model.layers.34.self_attn.o_proj.bias', 'model.layers.35.mlp.down_proj.bias', 'model.layers.35.mlp.gate_proj.bias', 'model.layers.35.mlp.up_proj.bias', 'model.layers.35.self_attn.o_proj.bias', 'model.layers.36.mlp.down_proj.bias', 'model.layers.36.mlp.gate_proj.bias', 'model.layers.36.mlp.up_proj.bias', 'model.layers.36.self_attn.o_proj.bias', 'model.layers.37.mlp.down_proj.bias', 'model.layers.37.mlp.gate_proj.bias', 'model.layers.37.mlp.up_proj.bias', 'model.layers.37.self_attn.o_proj.bias', 'model.layers.38.mlp.down_proj.bias', 'model.layers.38.mlp.gate_proj.bias', 'model.layers.38.mlp.up_proj.bias', 'model.layers.38.self_attn.o_proj.bias', 'model.layers.39.mlp.down_proj.bias', 'model.layers.39.mlp.gate_proj.bias', 'model.layers.39.mlp.up_proj.bias', 'model.layers.39.self_attn.o_proj.bias', 'model.layers.4.mlp.down_proj.bias', 'model.layers.4.mlp.gate_proj.bias', 'model.layers.4.mlp.up_proj.bias', 'model.layers.4.self_attn.o_proj.bias', 'model.layers.40.mlp.down_proj.bias', 'model.layers.40.mlp.gate_proj.bias', 'model.layers.40.mlp.up_proj.bias', 'model.layers.40.self_attn.o_proj.bias', 'model.layers.41.mlp.down_proj.bias', 'model.layers.41.mlp.gate_proj.bias', 'model.layers.41.mlp.up_proj.bias', 'model.layers.41.self_attn.o_proj.bias', 'model.layers.42.mlp.down_proj.bias', 'model.layers.42.mlp.gate_proj.bias', 'model.layers.42.mlp.up_proj.bias', 'model.layers.42.self_attn.o_proj.bias', 'model.layers.43.mlp.down_proj.bias', 'model.layers.43.mlp.gate_proj.bias', 'model.layers.43.mlp.up_proj.bias', 'model.layers.43.self_attn.o_proj.bias', 'model.layers.44.mlp.down_proj.bias', 'model.layers.44.mlp.gate_proj.bias', 'model.layers.44.mlp.up_proj.bias', 'model.layers.44.self_attn.o_proj.bias', 'model.layers.45.mlp.down_proj.bias', 'model.layers.45.mlp.gate_proj.bias', 'model.layers.45.mlp.up_proj.bias', 'model.layers.45.self_attn.o_proj.bias', 'model.layers.46.mlp.down_proj.bias', 'model.layers.46.mlp.gate_proj.bias', 'model.layers.46.mlp.up_proj.bias', 'model.layers.46.self_attn.o_proj.bias', 'model.layers.47.mlp.down_proj.bias', 'model.layers.47.mlp.gate_proj.bias', 'model.layers.47.mlp.up_proj.bias', 'model.layers.47.self_attn.o_proj.bias', 'model.layers.48.mlp.down_proj.bias', 'model.layers.48.mlp.gate_proj.bias', 'model.layers.48.mlp.up_proj.bias', 'model.layers.48.self_attn.o_proj.bias', 'model.layers.49.mlp.down_proj.bias', 'model.layers.49.mlp.gate_proj.bias', 'model.layers.49.mlp.up_proj.bias', 'model.layers.49.self_attn.o_proj.bias', 'model.layers.5.mlp.down_proj.bias', 'model.layers.5.mlp.gate_proj.bias', 'model.layers.5.mlp.up_proj.bias', 'model.layers.5.self_attn.o_proj.bias', 'model.layers.50.mlp.down_proj.bias', 'model.layers.50.mlp.gate_proj.bias', 'model.layers.50.mlp.up_proj.bias', 'model.layers.50.self_attn.o_proj.bias', 'model.layers.51.mlp.down_proj.bias', 'model.layers.51.mlp.gate_proj.bias', 'model.layers.51.mlp.up_proj.bias', 'model.layers.51.self_attn.o_proj.bias', 'model.layers.52.mlp.down_proj.bias', 'model.layers.52.mlp.gate_proj.bias', 'model.layers.52.mlp.up_proj.bias', 'model.layers.52.self_attn.o_proj.bias', 'model.layers.53.mlp.down_proj.bias', 'model.layers.53.mlp.gate_proj.bias', 'model.layers.53.mlp.up_proj.bias', 'model.layers.53.self_attn.o_proj.bias', 'model.layers.54.mlp.down_proj.bias', 'model.layers.54.mlp.gate_proj.bias', 'model.layers.54.mlp.up_proj.bias', 'model.layers.54.self_attn.o_proj.bias', 'model.layers.55.mlp.down_proj.bias', 'model.layers.55.mlp.gate_proj.bias', 'model.layers.55.mlp.up_proj.bias', 'model.layers.55.self_attn.o_proj.bias', 'model.layers.56.mlp.down_proj.bias', 'model.layers.56.mlp.gate_proj.bias', 'model.layers.56.mlp.up_proj.bias', 'model.layers.56.self_attn.o_proj.bias', 'model.layers.57.mlp.down_proj.bias', 'model.layers.57.mlp.gate_proj.bias', 'model.layers.57.mlp.up_proj.bias', 'model.layers.57.self_attn.o_proj.bias', 'model.layers.58.mlp.down_proj.bias', 'model.layers.58.mlp.gate_proj.bias', 'model.layers.58.mlp.up_proj.bias', 'model.layers.58.self_attn.o_proj.bias', 'model.layers.59.mlp.down_proj.bias', 'model.layers.59.mlp.gate_proj.bias', 'model.layers.59.mlp.up_proj.bias', 'model.layers.59.self_attn.o_proj.bias', 'model.layers.6.mlp.down_proj.bias', 'model.layers.6.mlp.gate_proj.bias', 'model.layers.6.mlp.up_proj.bias', 'model.layers.6.self_attn.o_proj.bias', 'model.layers.60.mlp.down_proj.bias', 'model.layers.60.mlp.gate_proj.bias', 'model.layers.60.mlp.up_proj.bias', 'model.layers.60.self_attn.o_proj.bias', 'model.layers.61.mlp.down_proj.bias', 'model.layers.61.mlp.gate_proj.bias', 'model.layers.61.mlp.up_proj.bias', 'model.layers.61.self_attn.o_proj.bias', 'model.layers.62.mlp.down_proj.bias', 'model.layers.62.mlp.gate_proj.bias', 'model.layers.62.mlp.up_proj.bias', 'model.layers.62.self_attn.o_proj.bias', 'model.layers.63.mlp.down_proj.bias', 'model.layers.63.mlp.gate_proj.bias', 'model.layers.63.mlp.up_proj.bias', 'model.layers.63.self_attn.o_proj.bias', 'model.layers.64.mlp.down_proj.bias', 'model.layers.64.mlp.gate_proj.bias', 'model.layers.64.mlp.up_proj.bias', 'model.layers.64.self_attn.o_proj.bias', 'model.layers.65.mlp.down_proj.bias', 'model.layers.65.mlp.gate_proj.bias', 'model.layers.65.mlp.up_proj.bias', 'model.layers.65.self_attn.o_proj.bias', 'model.layers.66.mlp.down_proj.bias', 'model.layers.66.mlp.gate_proj.bias', 'model.layers.66.mlp.up_proj.bias', 'model.layers.66.self_attn.o_proj.bias', 'model.layers.67.mlp.down_proj.bias', 'model.layers.67.mlp.gate_proj.bias', 'model.layers.67.mlp.up_proj.bias', 'model.layers.67.self_attn.o_proj.bias', 'model.layers.68.mlp.down_proj.bias', 'model.layers.68.mlp.gate_proj.bias', 'model.layers.68.mlp.up_proj.bias', 'model.layers.68.self_attn.o_proj.bias', 'model.layers.69.mlp.down_proj.bias', 'model.layers.69.mlp.gate_proj.bias', 'model.layers.69.mlp.up_proj.bias', 'model.layers.69.self_attn.o_proj.bias', 'model.layers.7.mlp.down_proj.bias', 'model.layers.7.mlp.gate_proj.bias', 'model.layers.7.mlp.up_proj.bias', 'model.layers.7.self_attn.o_proj.bias', 'model.layers.70.mlp.down_proj.bias', 'model.layers.70.mlp.gate_proj.bias', 'model.layers.70.mlp.up_proj.bias', 'model.layers.70.self_attn.o_proj.bias', 'model.layers.71.mlp.down_proj.bias', 'model.layers.71.mlp.gate_proj.bias', 'model.layers.71.mlp.up_proj.bias', 'model.layers.71.self_attn.o_proj.bias', 'model.layers.72.mlp.down_proj.bias', 'model.layers.72.mlp.gate_proj.bias', 'model.layers.72.mlp.up_proj.bias', 'model.layers.72.self_attn.o_proj.bias', 'model.layers.73.mlp.down_proj.bias', 'model.layers.73.mlp.gate_proj.bias', 'model.layers.73.mlp.up_proj.bias', 'model.layers.73.self_attn.o_proj.bias', 'model.layers.74.mlp.down_proj.bias', 'model.layers.74.mlp.gate_proj.bias', 'model.layers.74.mlp.up_proj.bias', 'model.layers.74.self_attn.o_proj.bias', 'model.layers.75.mlp.down_proj.bias', 'model.layers.75.mlp.gate_proj.bias', 'model.layers.75.mlp.up_proj.bias', 'model.layers.75.self_attn.o_proj.bias', 'model.layers.76.mlp.down_proj.bias', 'model.layers.76.mlp.gate_proj.bias', 'model.layers.76.mlp.up_proj.bias', 'model.layers.76.self_attn.o_proj.bias', 'model.layers.77.mlp.down_proj.bias', 'model.layers.77.mlp.gate_proj.bias', 'model.layers.77.mlp.up_proj.bias', 'model.layers.77.self_attn.o_proj.bias', 'model.layers.78.mlp.down_proj.bias', 'model.layers.78.mlp.gate_proj.bias', 'model.layers.78.mlp.up_proj.bias', 'model.layers.78.self_attn.o_proj.bias', 'model.layers.79.mlp.down_proj.bias', 'model.layers.79.mlp.gate_proj.bias', 'model.layers.79.mlp.up_proj.bias', 'model.layers.79.self_attn.o_proj.bias', 'model.layers.8.mlp.down_proj.bias', 'model.layers.8.mlp.gate_proj.bias', 'model.layers.8.mlp.up_proj.bias', 'model.layers.8.self_attn.o_proj.bias', 'model.layers.9.mlp.down_proj.bias', 'model.layers.9.mlp.gate_proj.bias', 'model.layers.9.mlp.up_proj.bias', 'model.layers.9.self_attn.o_proj.bias']
- This IS expected if you are initializing Qwen2ForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Qwen2ForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
loading weight in each layer...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [02:39<00:00,  2.00s/it]
loading weight in each layer...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [02:38<00:00,  1.98s/it]
Total time of converting checkpoints: 00:06:16

trtllm-build --checkpoint_dir /container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_checkpoint_2gpu_Int4 \ --output_dir /container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_engine_2gpu_Int4 \ --gemm_plugin float16

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070200
[07/09/2024-06:57:06] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set gemm_plugin to float16.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set nccl_plugin to auto.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set lookup_plugin to None.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set lora_plugin to None.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set moe_plugin to auto.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set context_fmha to True.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set paged_kv_cache to True.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set remove_input_padding to True.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set reduce_fusion to False.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set multi_block_mode to False.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set enable_xqa to True.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set tokens_per_block to 64.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set multiple_profiles to False.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set paged_state to True.
[07/09/2024-06:57:06] [TRT-LLM] [I] Set streamingllm to False.
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_scaling = None
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.moe_normalization_mode = 0
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.rotary_base = 1000000.0
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.qwen_type = qwen2
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.moe_num_experts = 0
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.moe_top_k = 0
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.moe_intermediate_size = 0
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.moe_shared_expert_intermediate_size = 0
[07/09/2024-06:57:06] [TRT-LLM] [W] Implicitly setting PretrainedConfig.disable_weight_only_quant_plugin = False
[07/09/2024-06:57:06] [TRT-LLM] [I] Compute capability: (8, 9)
[07/09/2024-06:57:06] [TRT-LLM] [I] SM count: 92
[07/09/2024-06:57:06] [TRT-LLM] [I] SM clock: 2520 MHz
[07/09/2024-06:57:06] [TRT-LLM] [I] int4 TFLOPS: 474
[07/09/2024-06:57:06] [TRT-LLM] [I] int8 TFLOPS: 237
[07/09/2024-06:57:06] [TRT-LLM] [I] fp8 TFLOPS: 237
[07/09/2024-06:57:06] [TRT-LLM] [I] float16 TFLOPS: 118
[07/09/2024-06:57:06] [TRT-LLM] [I] bfloat16 TFLOPS: 118
[07/09/2024-06:57:06] [TRT-LLM] [I] float32 TFLOPS: 59
[07/09/2024-06:57:06] [TRT-LLM] [I] Total Memory: 44 GiB
[07/09/2024-06:57:06] [TRT-LLM] [I] Memory clock: 9001 MHz
[07/09/2024-06:57:06] [TRT-LLM] [I] Memory bus width: 384
[07/09/2024-06:57:06] [TRT-LLM] [I] Memory bandwidth: 864 GB/s
[07/09/2024-06:57:06] [TRT-LLM] [I] PCIe speed: 2500 Mbps
[07/09/2024-06:57:06] [TRT-LLM] [I] PCIe link width: 16
[07/09/2024-06:57:06] [TRT-LLM] [I] PCIe bandwidth: 5 GB/s
[07/09/2024-06:57:06] [TRT-LLM] [I] max_seq_len is not specified, using value 32768
[07/09/2024-06:57:06] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width. 

[07/09/2024-06:57:06] [TRT-LLM] [W] padding removal and fMHA are both enabled, max_input_len is not required and will be ignored
[07/09/2024-06:57:08] [TRT-LLM] [I] Set dtype to float16.
[07/09/2024-06:57:08] [TRT] [I] [MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 143, GPU 290 (MiB)
[07/09/2024-06:57:10] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1643, GPU +292, now: CPU 1934, GPU 582 (MiB)
[07/09/2024-06:57:10] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/09/2024-06:57:10] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to float16.
[07/09/2024-06:57:10] [TRT-LLM] [I] Set nccl_plugin to float16.
[07/09/2024-06:57:10] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/09/2024-06:57:10] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[07/09/2024-06:57:10] [TRT] [W] Unused Input: position_ids
[07/09/2024-06:57:11] [TRT] [W] Detected layernorm nodes in FP16.
[07/09/2024-06:57:11] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[07/09/2024-06:57:11] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[07/09/2024-06:57:11] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[07/09/2024-06:57:17] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[07/09/2024-06:57:17] [TRT] [I] Detected 15 inputs and 1 output network tensors.
[07/09/2024-06:57:29] [TRT] [I] Total Host Persistent Memory: 361472
[07/09/2024-06:57:29] [TRT] [I] Total Device Persistent Memory: 0
[07/09/2024-06:57:29] [TRT] [I] Total Scratch Memory: 268468224
[07/09/2024-06:57:29] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1538 steps to complete.
[07/09/2024-06:57:29] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 127.69ms to assign 17 blocks to 1538 nodes requiring 1023449088 bytes.
[07/09/2024-06:57:29] [TRT] [I] Total Activation Memory: 1023448064
[07/09/2024-06:57:29] [TRT] [I] Total Weights Memory: 21927018496
[07/09/2024-06:58:16] [TRT] [I] Engine generation completed in 65.7239 seconds.
[07/09/2024-06:58:16] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 16 MiB, GPU 20911 MiB
[07/09/2024-06:58:24] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 47433 MiB
[07/09/2024-06:58:25] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:01:14
[07/09/2024-06:58:25] [TRT] [I] Serialized 26 bytes of code generator cache.
[07/09/2024-06:58:25] [TRT] [I] Serialized 178026 bytes of compilation cache.
[07/09/2024-06:58:25] [TRT] [I] Serialized 11 timing cache entries
[07/09/2024-06:58:25] [TRT-LLM] [I] Timing cache serialized to model.cache
[07/09/2024-06:58:25] [TRT-LLM] [I] Serializing engine to /container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_engine_2gpu_Int4/rank0.engine...
[07/09/2024-06:58:33] [TRT-LLM] [I] Engine serialized. Total time: 00:00:08
[07/09/2024-06:58:35] [TRT-LLM] [I] Set dtype to float16.
[07/09/2024-06:58:35] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2068, GPU 640 (MiB)
[07/09/2024-06:58:35] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[07/09/2024-06:58:35] [TRT-LLM] [I] Set weight_only_groupwise_quant_matmul_plugin to float16.
[07/09/2024-06:58:35] [TRT-LLM] [I] Set nccl_plugin to float16.
[07/09/2024-06:58:35] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[07/09/2024-06:58:36] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[07/09/2024-06:58:36] [TRT] [W] Unused Input: position_ids
[07/09/2024-06:58:36] [TRT] [W] Detected layernorm nodes in FP16.
[07/09/2024-06:58:36] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[07/09/2024-06:58:36] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[07/09/2024-06:58:36] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[07/09/2024-06:58:42] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[07/09/2024-06:58:42] [TRT] [I] Detected 15 inputs and 1 output network tensors.
[07/09/2024-06:58:53] [TRT] [I] Total Host Persistent Memory: 361472
[07/09/2024-06:58:53] [TRT] [I] Total Device Persistent Memory: 0
[07/09/2024-06:58:53] [TRT] [I] Total Scratch Memory: 268468224
[07/09/2024-06:58:53] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1538 steps to complete.
[07/09/2024-06:58:53] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 128.963ms to assign 17 blocks to 1538 nodes requiring 1023449088 bytes.
[07/09/2024-06:58:53] [TRT] [I] Total Activation Memory: 1023448064
[07/09/2024-06:58:53] [TRT] [I] Total Weights Memory: 21927018496
[07/09/2024-06:58:53] [TRT] [I] Engine generation completed in 17.2935 seconds.
[07/09/2024-06:58:53] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 16 MiB, GPU 20911 MiB
[07/09/2024-06:59:01] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 70190 MiB
[07/09/2024-06:59:01] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:25
[07/09/2024-06:59:02] [TRT-LLM] [I] Serializing engine to /container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_engine_2gpu_Int4/rank1.engine...
[07/09/2024-06:59:10] [TRT-LLM] [I] Engine serialized. Total time: 00:00:08
[07/09/2024-06:59:10] [TRT-LLM] [I] Total time of building all engines: 00:02:03

mpirun -n 2 --allow-run-as-root python3 ../run.py --input_text "Hi. What's Your Name?" \ --max_output_len=500 \ --tokenizer_dir /container_dir/Qwen/models--Qwen--Qwen2-72B-Instruct-GPTQ-Int4/snapshots/6b82a333287651211b1cae443ff2d2a6802597b9/ \ --engine_dir=/container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_engine_2gpu_Int4/

Expected behavior

Input [Text 0]: "<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is your name?<|im_end|>
<|im_start|>assistant
"
Output [Text 0 Beam 0]: "I am QianWen, a large language model created by Alibaba Cloud."

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070200
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20437,1],0]
  Exit code:    1
--------------------------------------------------------------------------

additional notes

I was looking for a solution and saw this: https://github.com/QwenLM/Qwen2/issues/712 Then I see that his version is 0.11, I switch to tag: v0.10.0, same problem. Then I tried using Triton Inference Server, same error:

NVIDIA Release 24.05 (build 95110614)
Triton Server Version 2.46.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 12.3 driver version 545.23.08 with kernel driver version 535.154.05.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

root@iZuf62vrhc5baq8umqc6nsZ:/opt/tritonserver# python3 /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 2
root@iZuf62vrhc5baq8umqc6nsZ:/opt/tritonserver# I0709 07:22:30.885703 377 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f16b6000000' with size 268435456"
I0709 07:22:30.886060 378 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f4386000000' with size 268435456"
I0709 07:22:30.887483 377 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0709 07:22:30.887489 377 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0709 07:22:30.887492 377 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0709 07:22:30.887494 377 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
I0709 07:22:30.887736 378 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
I0709 07:22:30.887743 378 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 1 with size 67108864"
I0709 07:22:30.887745 378 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 2 with size 67108864"
I0709 07:22:30.887747 378 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 3 with size 67108864"
W0709 07:22:31.115211 378 server.cc:251] "failed to enable peer access for some device pairs"
W0709 07:22:31.115479 377 server.cc:251] "failed to enable peer access for some device pairs"
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $
E0709 07:22:31.115832 378 model_repository_manager.cc:1371] "Poll failed for model directory 'tensorrt_llm': failed to read text proto from /all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt"
I0709 07:22:31.115859 378 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0709 07:22:31.115870 378 server.cc:633] 
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0709 07:22:31.115878 378 server.cc:676] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $
E0709 07:22:31.116058 377 model_repository_manager.cc:1371] "Poll failed for model directory 'ensemble': failed to read text proto from /all_models/inflight_batcher_llm/ensemble/config.pbtxt"
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $
E0709 07:22:31.116113 377 model_repository_manager.cc:1371] "Poll failed for model directory 'postprocessing': failed to read text proto from /all_models/inflight_batcher_llm/postprocessing/config.pbtxt"
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $
E0709 07:22:31.116142 377 model_repository_manager.cc:1371] "Poll failed for model directory 'preprocessing': failed to read text proto from /all_models/inflight_batcher_llm/preprocessing/config.pbtxt"
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $
E0709 07:22:31.116181 377 model_repository_manager.cc:1371] "Poll failed for model directory 'tensorrt_llm': failed to read text proto from /all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt"
[libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $
E0709 07:22:31.116223 377 model_repository_manager.cc:1371] "Poll failed for model directory 'tensorrt_llm_bls': failed to read text proto from /all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt"
I0709 07:22:31.116234 377 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0709 07:22:31.116243 377 server.cc:633] 
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0709 07:22:31.116251 377 server.cc:676] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0709 07:22:31.173415 378 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA L20"
I0709 07:22:31.173440 378 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA L20"
I0709 07:22:31.173447 378 metrics.cc:877] "Collecting metrics for GPU 2: NVIDIA L20"
I0709 07:22:31.173453 378 metrics.cc:877] "Collecting metrics for GPU 3: NVIDIA L20"
I0709 07:22:31.173456 377 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA L20"
I0709 07:22:31.173478 377 metrics.cc:877] "Collecting metrics for GPU 1: NVIDIA L20"
I0709 07:22:31.173483 377 metrics.cc:877] "Collecting metrics for GPU 2: NVIDIA L20"
I0709 07:22:31.173487 377 metrics.cc:877] "Collecting metrics for GPU 3: NVIDIA L20"
I0709 07:22:31.202836 378 metrics.cc:770] "Collecting CPU metrics"
I0709 07:22:31.202923 378 tritonserver.cc:2557] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                    |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                   |
| server_version                   | 2.46.0                                                                                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_d |
|                                  | ata parameters statistics trace logging                                                                                                                                  |
| model_repository_path[0]         | /all_models/inflight_batcher_llm                                                                                                                                         |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                            |
| startup_models_0                 | tensorrt_llm                                                                                                                                                             |
| strict_model_config              | 1                                                                                                                                                                        |
| model_config_name                |                                                                                                                                                                          |
| rate_limit                       | OFF                                                                                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                                                                                      |
| strict_readiness                 | 1                                                                                                                                                                        |
| exit_timeout                     | 30                                                                                                                                                                       |
| cache_enabled                    | 0                                                                                                                                                                        |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0709 07:22:31.202960 378 server.cc:307] "Waiting for in-flight requests to complete."
I0709 07:22:31.202964 378 server.cc:323] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0709 07:22:31.202974 378 server.cc:338] "All models are stopped, unloading models"
I0709 07:22:31.202977 378 server.cc:347] "Timeout 30: Found 0 live models and 0 in-flight non-inference requests"
I0709 07:22:31.210250 377 metrics.cc:770] "Collecting CPU metrics"
I0709 07:22:31.210325 377 tritonserver.cc:2557] 
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                    |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                   |
| server_version                   | 2.46.0                                                                                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_d |
|                                  | ata parameters statistics trace logging                                                                                                                                  |
| model_repository_path[0]         | /all_models/inflight_batcher_llm                                                                                                                                         |
| model_control_mode               | MODE_NONE                                                                                                                                                                |
| strict_model_config              | 1                                                                                                                                                                        |
| model_config_name                |                                                                                                                                                                          |
| rate_limit                       | OFF                                                                                                                                                                      |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                                 |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                                                                                      |
| strict_readiness                 | 1                                                                                                                                                                        |
| exit_timeout                     | 30                                                                                                                                                                       |
| cache_enabled                    | 0                                                                                                                                                                        |
+----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0709 07:22:31.210353 377 server.cc:307] "Waiting for in-flight requests to complete."
I0709 07:22:31.210356 377 server.cc:323] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0709 07:22:31.210363 377 server.cc:338] "All models are stopped, unloading models"
I0709 07:22:31.210365 377 server.cc:347] "Timeout 30: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models
error: creating server: Internal - failed to load all models
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[14765,1],0]
  Exit code:    1
--------------------------------------------------------------------------
linchpinlin commented 1 month ago

I upgraded the version of TensorRT-LLM: TensorRT-LLM version: 0.12.0.dev2024070900 export NCCL_DEBUG=INFO mpirun -n 2 --allow-run-as-root python3 ../run.py --input_text "Hi. What's Your Name?" --max_output_len=500 --tokenizer_dir /container_dir/Qwen/models--Qwen--Qwen2-72B-Instruct-GPTQ-Int4/snapshots/6b82a333287651211b1cae443ff2d2a6802597b9/ --engine_dir=/container_dir/Qwen/Qwen2-72B-Instruct-GPTQ-Int4-trtllm_engine_2gpu_Int4/

[TensorRT-LLM] TensorRT-LLM version: 0.12.0.dev2024070900
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Engine version 0.12.0.dev2024070900 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 0
[TensorRT-LLM][INFO] MPI size: 2, MPI local size: 2, rank: 1
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] Rank 1 is using GPU 1
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 32768
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 256
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 256
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 32768
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 32768
[TensorRT-LLM][INFO] TRTGptModel computeContextLogits: 0
[TensorRT-LLM][INFO] TRTGptModel computeGenerationLogits: 0
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 1
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 8192
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 8192 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][WARNING] Device 0 peer access Device 1 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 0 peer access Device 3 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 0 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 2 is not available.
[TensorRT-LLM][WARNING] Device 1 peer access Device 3 is not available.
[TensorRT-LLM][INFO] Loaded engine size: 20929 MiB
[TensorRT-LLM][INFO] Loaded engine size: 20929 MiB
eed4900070d2:17067:17067 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
eed4900070d2:17067:17067 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
eed4900070d2:17067:17067 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
eed4900070d2:17067:17067 [0] NCCL INFO P2P plugin IBext_v8
eed4900070d2:17067:17067 [0] NCCL INFO NET/IB : No device found.
eed4900070d2:17067:17067 [0] NCCL INFO NET/IB : No device found.
eed4900070d2:17067:17067 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
eed4900070d2:17067:17067 [0] NCCL INFO Using non-device net plugin version 0
eed4900070d2:17067:17067 [0] NCCL INFO Using network Socket
eed4900070d2:17068:17068 [1] NCCL INFO cudaDriverVersion 12040
eed4900070d2:17068:17068 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
eed4900070d2:17068:17068 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
eed4900070d2:17068:17068 [1] NCCL INFO P2P plugin IBext_v8
eed4900070d2:17068:17068 [1] NCCL INFO NET/IB : No device found.
eed4900070d2:17068:17068 [1] NCCL INFO NET/IB : No device found.
eed4900070d2:17068:17068 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
eed4900070d2:17068:17068 [1] NCCL INFO Using non-device net plugin version 0
eed4900070d2:17068:17068 [1] NCCL INFO Using network Socket
eed4900070d2:17068:17068 [1] NCCL INFO comm 0x561e5e75aa70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 40 commId 0xfae597cbcb6168e3 - Init START
eed4900070d2:17067:17067 [0] NCCL INFO comm 0x563e6b173910 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 30 commId 0xfae597cbcb6168e3 - Init START
eed4900070d2:17067:17067 [0] NCCL INFO comm 0x563e6b173910 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
eed4900070d2:17068:17068 [1] NCCL INFO comm 0x561e5e75aa70 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
eed4900070d2:17068:17068 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
eed4900070d2:17068:17068 [1] NCCL INFO P2P Chunksize set to 131072
eed4900070d2:17067:17067 [0] NCCL INFO Channel 00/02 :    0   1
eed4900070d2:17067:17067 [0] NCCL INFO Channel 01/02 :    0   1
eed4900070d2:17067:17067 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
eed4900070d2:17067:17067 [0] NCCL INFO P2P Chunksize set to 131072
eed4900070d2:17068:17068 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
eed4900070d2:17067:17067 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
eed4900070d2:17068:17068 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
eed4900070d2:17067:17067 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
eed4900070d2:17068:17068 [1] NCCL INFO Connected all rings
eed4900070d2:17068:17068 [1] NCCL INFO Connected all trees
eed4900070d2:17068:17068 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
eed4900070d2:17068:17068 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
eed4900070d2:17067:17067 [0] NCCL INFO Connected all rings
eed4900070d2:17067:17067 [0] NCCL INFO Connected all trees
eed4900070d2:17067:17067 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
eed4900070d2:17067:17067 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer

eed4900070d2:17068:17339 [1] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-sTZ9fa to 5767524 bytes

eed4900070d2:17068:17339 [1] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-sTZ9fa (size 5767520)

eed4900070d2:17067:17338 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-dP5qkD to 5767524 bytes

eed4900070d2:17067:17338 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-dP5qkD (size 5767520)
eed4900070d2:17067:17338 [0] NCCL INFO proxy.cc:1252 -> 2
eed4900070d2:17067:17338 [0] NCCL INFO proxy.cc:1315 -> 2
eed4900070d2:17068:17339 [1] NCCL INFO proxy.cc:1252 -> 2
eed4900070d2:17068:17339 [1] NCCL INFO proxy.cc:1315 -> 2
eed4900070d2:17068:17068 [1] NCCL INFO proxy.cc:1064 -> 2
eed4900070d2:17068:17068 [1] NCCL INFO init.cc:1328 -> 2
eed4900070d2:17067:17067 [0] NCCL INFO proxy.cc:1064 -> 2
eed4900070d2:17067:17067 [0] NCCL INFO init.cc:1328 -> 2
eed4900070d2:17067:17067 [0] NCCL INFO init.cc:1501 -> 2
eed4900070d2:17067:17067 [0] NCCL INFO init.cc:1746 -> 2
eed4900070d2:17068:17068 [1] NCCL INFO init.cc:1501 -> 2
eed4900070d2:17068:17068 [1] NCCL INFO init.cc:1746 -> 2
eed4900070d2:17067:17067 [0] NCCL INFO init.cc:1784 -> 2
Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
eed4900070d2:17068:17068 [1] NCCL INFO init.cc:1784 -> 2
Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_MergeRequest/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63581,1],1]
  Exit code:    1
--------------------------------------------------------------------------
NitinAggarwal1 commented 1 month ago

Hi @QiJune Is there a resolution for this one?

hjunjie0324 commented 3 weeks ago

try to add "--use_custom_all_reduce disable" when do trtllm-build. It works for me. I don't know exactly why it works. I guess it changes the way how multi-gpu communicating

hjunjie0324 commented 3 weeks ago

additionally, "--use_custom_all_reduce" option is removed in latest tensorrt-llm. I don't know why they do that

byshiue commented 2 weeks ago

In latest tensorrt-llm, the all reduce plugin will check that could we use custom_all_reudce automatically, user don't need to setup it manaully now.

NVGaryJi commented 1 week ago

@linchpinlin @hjunjie0324 Can you verify with latest TRT-LLM (without setting --use_custom_all_reduce option) and see if it works? As byshiue explained, whether to use custom_all_reduce is now determined automatically.

dhruvmullick commented 16 hours ago

@byshiue is it possible to disable it though? I'm facing similar problems with tp>1 https://github.com/triton-inference-server/tensorrtllm_backend/issues/577