Closed QLinfeng closed 3 months ago
Please change python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"
to
mpirun -n 4 --allow-run-as-root python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"
I had this problem first, then I modified the parameters and then I got an MPI error.
################
Traceback (most recent call last):
File "/media/TensorRT-LLM-main/examples/run.py", line 503, in
@nv-guomingz I tried your command.
When "mapping": {
"world_size": 1,
"gpus_per_node": 8,
"tp_size": 1,
"pp_size": 1,
"moe_tp_size": 4,
"moe_ep_size": 1
}
I got error:
Traceback (most recent call last):
File "/media/TensorRT-LLM-main/examples/run.py", line 503, in
@nv-guomingz
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[9389,1],3] Exit code: 1
Hi @QLinfeng , may I know the exactly output of below commands w/o any modification on configuration file ?
python3 convert_checkpoint.py --model_dir /media/Llama-2-13b-hf/ --output_dir /media/tllm_checkpoint_4gpu_tp4 --dtype float16 --tp_size 4
trtllm-build --checkpoint_dir /media/tllm_checkpoint_4gpu_tp4 --output_dir /media/llama-13b-engines-fp16-4gpu --gemm_plugin float16 --multi_block_mode enable
mpirun -n 4 --allow-run-as-root python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"
@nv-guomingz
python3 convert_checkpoint.py --model_dir /media/Llama-2-13b-hf/ --output_dir /media/tllm_checkpoint_4gpu_tp4 --dtype float16 --tp_size 4
Total time of converting checkpoints: 00:00:42
trtllm-build --checkpoint_dir /media/tllm_checkpoint_4gpu_tp4 --output_dir /media/llama-13b-engines-fp16-4gpu --gemm_plugin float16 --multi_block_mode enable
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[06/24/2024-05:45:38] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set gemm_plugin to float16.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set nccl_plugin to auto.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set lookup_plugin to None.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set lora_plugin to None.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set moe_plugin to auto.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set context_fmha to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set paged_kv_cache to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set remove_input_padding to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set reduce_fusion to False.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set multi_block_mode to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set enable_xqa to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set tokens_per_block to 64.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set multiple_profiles to False.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set paged_state to True.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set streamingllm to False.
[06/24/2024-05:45:38] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.
[06/24/2024-05:45:38] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:45:38] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 146, GPU 4011 (MiB) [06/24/2024-05:45:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1622, GPU +290, now: CPU 1915, GPU 4301 (MiB) [06/24/2024-05:45:41] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:45:41] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:45:41] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:45:41] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:45:41] [TRT] [W] Unused Input: position_ids [06/24/2024-05:45:41] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:45:41] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:45:41] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:45:41] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:45:45] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:45:45] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:45:52] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:45:52] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:45:52] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:45:52] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:45:52] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 54.2169ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:45:52] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:45:52] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:45:52] [TRT] [I] Engine generation completed in 10.2923 seconds. [06/24/2024-05:45:52] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:45:54] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 16366 MiB [06/24/2024-05:45:55] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:13 [06/24/2024-05:45:55] [TRT] [I] Serialized 26 bytes of code generator cache. [06/24/2024-05:45:55] [TRT] [I] Serialized 165354 bytes of compilation cache. [06/24/2024-05:45:55] [TRT] [I] Serialized 13 timing cache entries [06/24/2024-05:45:55] [TRT-LLM] [I] Timing cache serialized to model.cache [06/24/2024-05:45:55] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank0.engine... [06/24/2024-05:45:57] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:45:58] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:45:58] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2019, GPU 4323 (MiB) [06/24/2024-05:45:58] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:45:58] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:45:58] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:45:58] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:45:58] [TRT] [W] Unused Input: position_ids [06/24/2024-05:45:58] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:45:58] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:45:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:45:58] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:46:02] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:46:02] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:46:06] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:46:06] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:46:06] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:46:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:46:06] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 54.1244ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:46:06] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:46:06] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:46:06] [TRT] [I] Engine generation completed in 7.34736 seconds. [06/24/2024-05:46:06] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:46:08] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 22904 MiB [06/24/2024-05:46:09] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:10 [06/24/2024-05:46:09] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank1.engine... [06/24/2024-05:46:11] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:46:12] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:46:12] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2020, GPU 4323 (MiB) [06/24/2024-05:46:12] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:46:12] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:46:12] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:46:12] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:46:12] [TRT] [W] Unused Input: position_ids [06/24/2024-05:46:12] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:46:12] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:46:12] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:46:12] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:46:16] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:46:16] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:46:20] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:46:20] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:46:20] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:46:20] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:46:20] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 54.7622ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:46:20] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:46:20] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:46:20] [TRT] [I] Engine generation completed in 7.60425 seconds. [06/24/2024-05:46:20] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:46:23] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 22921 MiB [06/24/2024-05:46:23] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:10 [06/24/2024-05:46:23] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank2.engine... [06/24/2024-05:46:26] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:46:27] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:46:27] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2021, GPU 4323 (MiB) [06/24/2024-05:46:27] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:46:27] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:46:27] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:46:27] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:46:27] [TRT] [W] Unused Input: position_ids [06/24/2024-05:46:27] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:46:27] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:46:27] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:46:27] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:46:31] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:46:31] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:46:35] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:46:35] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:46:35] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:46:35] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:46:35] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 55.4726ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:46:35] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:46:35] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:46:35] [TRT] [I] Engine generation completed in 7.62432 seconds. [06/24/2024-05:46:35] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:46:38] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 22925 MiB [06/24/2024-05:46:38] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:10 [06/24/2024-05:46:38] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank3.engine... [06/24/2024-05:46:41] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:46:41] [TRT-LLM] [I] Total time of building all engines: 00:01:03
mpirun -n 4 --allow-run-as-root python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
dc69d25252pa:21856:21856 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
dc69d25252pa:21856:21856 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
dc69d25252pa:21856:21856 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.20.5+cuda12.4
dc69d25252pa:21857:21857 [1] NCCL INFO cudaDriverVersion 12030
dc69d25252pa:21857:21857 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
dc69d25252pa:21859:21859 [3] NCCL INFO cudaDriverVersion 12030
dc69d25252pa:21859:21859 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
dc69d25252pa:21857:21857 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
dc69d25252pa:21859:21859 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
dc69d25252pa:21857:21857 [1] NCCL INFO NET/IB : No device found.
dc69d25252pa:21857:21857 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
dc69d25252pa:21857:21857 [1] NCCL INFO Using non-device net plugin version 0
dc69d25252pa:21857:21857 [1] NCCL INFO Using network Socket
dc69d25252pa:21859:21859 [3] NCCL INFO NET/IB : No device found.
dc69d25252pa:21859:21859 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
dc69d25252pa:21859:21859 [3] NCCL INFO Using non-device net plugin version 0
dc69d25252pa:21859:21859 [3] NCCL INFO Using network Socket
dc69d25252pa:21856:21856 [0] NCCL INFO NET/IB : No device found.
dc69d25252pa:21856:21856 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
dc69d25252pa:21856:21856 [0] NCCL INFO Using non-device net plugin version 0
dc69d25252pa:21856:21856 [0] NCCL INFO Using network Socket
dc69d25252pa:21858:21858 [2] NCCL INFO cudaDriverVersion 12030
dc69d25252pa:21858:21858 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0>
dc69d25252pa:21858:21858 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
dc69d25252pa:21858:21858 [2] NCCL INFO NET/IB : No device found.
dc69d25252pa:21858:21858 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0>
dc69d25252pa:21858:21858 [2] NCCL INFO Using non-device net plugin version 0
dc69d25252pa:21858:21858 [2] NCCL INFO Using network Socket
dc69d25252pa:21857:21857 [1] NCCL INFO comm 0x5615da582d70 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 65000 commId 0x4a851099a3e07480 - Init START
dc69d25252pa:21859:21859 [3] NCCL INFO comm 0x55994c807510 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e3000 commId 0x4a851099a3e07480 - Init START
dc69d25252pa:21858:21858 [2] NCCL INFO comm 0x55e752e1ae10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId b1000 commId 0x4a851099a3e07480 - Init START
dc69d25252pa:21856:21856 [0] NCCL INFO comm 0x5576ed7e1f10 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4b000 commId 0x4a851099a3e07480 - Init START
dc69d25252pa:21858:21858 [2] NCCL INFO NVLS multicast support is not available on dev 2
dc69d25252pa:21857:21857 [1] NCCL INFO NVLS multicast support is not available on dev 1
dc69d25252pa:21856:21856 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff
dc69d25252pa:21856:21856 [0] NCCL INFO NVLS multicast support is not available on dev 0
dc69d25252pa:21859:21859 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,ffff0000
dc69d25252pa:21859:21859 [3] NCCL INFO NVLS multicast support is not available on dev 3
dc69d25252pa:21857:21857 [1] NCCL INFO comm 0x5615da582d70 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
dc69d25252pa:21857:21857 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
dc69d25252pa:21857:21857 [1] NCCL INFO P2P Chunksize set to 131072
dc69d25252pa:21856:21856 [0] NCCL INFO comm 0x5576ed7e1f10 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
dc69d25252pa:21856:21856 [0] NCCL INFO Channel 00/02 : 0 1 2 3
dc69d25252pa:21856:21856 [0] NCCL INFO Channel 01/02 : 0 1 2 3
dc69d25252pa:21858:21858 [2] NCCL INFO comm 0x55e752e1ae10 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
dc69d25252pa:21858:21858 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
dc69d25252pa:21858:21858 [2] NCCL INFO P2P Chunksize set to 131072
dc69d25252pa:21859:21859 [3] NCCL INFO comm 0x55994c807510 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
dc69d25252pa:21859:21859 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
dc69d25252pa:21859:21859 [3] NCCL INFO P2P Chunksize set to 131072
dc69d25252pa:21856:21856 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
dc69d25252pa:21856:21856 [0] NCCL INFO P2P Chunksize set to 131072
dc69d25252pa:21858:21858 [2] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-Y4njUk to 9637892 bytes
dc69d25252pa:21858:21858 [2] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-Y4njUk (size 9637888) dc69d25252pa:21858:21858 [2] NCCL INFO transport/shm.cc:114 -> 2
dc69d25252pa:21856:21856 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-YNr3AB to 9637892 bytes
dc69d25252pa:21856:21856 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-YNr3AB (size 9637888) dc69d25252pa:21856:21856 [0] NCCL INFO transport/shm.cc:114 -> 2 dc69d25252pa:21856:21856 [0] NCCL INFO transport.cc:33 -> 2
dc69d25252pa:21857:21857 [1] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-8EoRmY to 9637892 bytes
dc69d25252pa:21857:21857 [1] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-8EoRmY (size 9637888) dc69d25252pa:21857:21857 [1] NCCL INFO transport/shm.cc:114 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO transport.cc:33 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO transport.cc:113 -> 2
dc69d25252pa:21859:21859 [3] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-u71YZA to 9637892 bytes
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[16192,1],1] Exit code: 1
@nv-guomingz config.json { "version": "0.11.0.dev2024061800", "pretrained_config": { "mlp_bias": false, "attn_bias": false, "rotary_base": 10000.0, "rotary_scaling": null, "residual_mlp": false, "disable_weight_only_quant_plugin": false, "moe": { "num_experts": 0, "top_k": 0, "normalization_mode": null }, "architecture": "LlamaForCausalLM", "dtype": "float16", "vocab_size": 32000, "hidden_size": 5120, "num_hidden_layers": 40, "num_attention_heads": 40, "hidden_act": "silu", "logits_dtype": "float32", "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "max_position_embeddings": 4096, "num_key_value_heads": 40, "intermediate_size": 13824, "mapping": { "world_size": 4, "gpus_per_node": 8, "tp_size": 4, "pp_size": 1, "moe_tp_size": 4, "moe_ep_size": 1 }, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": null }, "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "head_size": 128, "qk_layernorm": false }, "build_config": { "max_input_len": 1024, "max_seq_len": 2048, "opt_batch_size": null, "max_batch_size": 256, "max_beam_width": 1, "max_num_tokens": 8192, "opt_num_tokens": 256, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": true, "builder_opt": null, "profiling_verbosity": "layer_names_only", "enable_debug_output": false, "max_draft_len": 0, "speculative_decoding_mode": 1, "use_refit": false, "input_timing_cache": null, "output_timing_cache": "model.cache", "lora_config": { "lora_dir": [], "lora_ckpt_source": "hf", "max_lora_rank": 64, "lora_target_modules": [], "trtllm_modules_to_hf_modules": {} }, "auto_parallel_config": { "world_size": 1, "gpus_per_node": 8, "cluster_key": "A40", "cluster_info": null, "sharding_cost_model": "alpha_beta", "comm_cost_model": "alpha_beta", "enable_pipeline_parallelism": false, "enable_shard_unbalanced_shape": false, "enable_shard_dynamic_shape": false, "enable_reduce_scatter": true, "builder_flags": null, "debug_mode": false, "infer_shape": true, "validation_mode": false, "same_buffer_io": { "past_keyvalue(\d+)": "present_keyvalue\1" }, "same_spec_io": {}, "sharded_io_allowlist": [ "past_keyvalue\d+", "present_keyvalue\d*" ], "fill_weights": false, "parallel_config_cache": null, "profile_cache": null, "dump_path": null, "debug_outputs": [] }, "weight_sparsity": false, "weight_streaming": false, "plugin_config": { "dtype": "float16", "bert_attention_plugin": "auto", "gpt_attention_plugin": "auto", "gemm_plugin": "float16", "gemm_swiglu_plugin": null, "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "moe_plugin": "auto", "mamba_conv1d_plugin": "auto", "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": true, "reduce_fusion": false, "multi_block_mode": true, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 64, "use_paged_context_fmha": false, "use_fp8_context_fmha": false, "multiple_profiles": false, "paged_state": true, "streamingllm": false }, "use_strip_plan": false, "max_encoder_input_len": 1024, "use_fused_mlp": false } }
Please increase your share memory size and try again.
Here is a similar issue, https://github.com/NVIDIA/TensorRT-LLM/issues/1702#issue-2325029617
Please increase your share memory size and try again.
Here is a similar issue, #1702 (comment)
This problem has been solved, no more errors. Great, thank you!!!
hey I tried increasing shm-size to 24g It still won't run. Here is my issue, #1950 , I've updated at the end of the issue.
System Info
CentOS Linux release 7.9.2009
Nvida A40 * 4
llama-2-13b-hf
TensorRT-LLM version: 0.11.0.dev2024061800
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 convert_checkpoint.py --model_dir /media/Llama-2-13b-hf/ --output_dir /media/tllm_checkpoint_4gpu_tp4 --dtype float16 --tp_size 4
trtllm-build --checkpoint_dir /media/tllm_checkpoint_4gpu_tp4 --output_dir /media/llama-13b-engines-fp16-4gpu --gemm_plugin float16 --multi_block_mode enable
python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"
Expected behavior
Shorten reasoning time and answer questions correctly
actual behavior
terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] Assertion failed: Failed: MPI error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/common/mpiUtils.cpp:211 '6' (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/common/mpiUtils.cpp:211) 1 0x7fb4aebbeb71 tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 82 2 0x7fb4aebbf999 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x728999) [0x7fb4aebbf999] 3 0x7fb461502a4e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x15da4e) [0x7fb461502a4e] 4 0x7fb4614edcec tensorrt_llm::plugins::AllreducePlugin::initialize() + 204 5 0x7fb580fb46e5 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x108c6e5) [0x7fb580fb46e5] 6 0x7fb580f41de2 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1019de2) [0x7fb580f41de2] 7 0x7fb580f499dd /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10219dd) [0x7fb580f499dd] 8 0x7fb580f4a93d /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x102293d) [0x7fb580f4a93d] 9 0x7fb580f4ada4 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1022da4) [0x7fb580f4ada4] 10 0x7fb580f7f7d0 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10577d0) [0x7fb580f7f7d0] 11 0x7fb580f80808 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1058808) [0x7fb580f80808] 12 0x7fb580f8090b /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x105890b) [0x7fb580f8090b] 13 0x7fb4b06ab004 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger, float, bool) + 1348 14 0x7fb4b08c7c72 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962
15 0x7fb4b08eaaf4 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420
16 0x7fb4b08eb388 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional const&, std::optional<std::vector<unsigned char, std::allocator > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304
17 0x7fb4b08f0b04 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764
18 0x7fb4b08e5ad0 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64
19 0x7fb52bae8df2 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5df2) [0x7fb52bae8df2]
20 0x7fb52ba8beac /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x58eac) [0x7fb52ba8beac]
21 0x55d6c2c6010e python3(+0x15a10e) [0x55d6c2c6010e]
22 0x55d6c2c56a7b _PyObject_MakeTpCall + 603
23 0x55d6c2c6ec20 python3(+0x168c20) [0x55d6c2c6ec20]
24 0x55d6c2c6b087 python3(+0x165087) [0x55d6c2c6b087]
25 0x55d6c2c56e2b python3(+0x150e2b) [0x55d6c2c56e2b]
26 0x7fb52ba8b4cb /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x584cb) [0x7fb52ba8b4cb]
27 0x55d6c2c56a7b _PyObject_MakeTpCall + 603
28 0x55d6c2c4f629 _PyEval_EvalFrameDefault + 27257
29 0x55d6c2c6e7f1 python3(+0x1687f1) [0x55d6c2c6e7f1]
30 0x55d6c2c6f492 PyObject_Call + 290
31 0x55d6c2c4b5d7 _PyEval_EvalFrameDefault + 10791
32 0x55d6c2c609fc _PyFunction_Vectorcall + 124
33 0x55d6c2c4926d _PyEval_EvalFrameDefault + 1725
34 0x55d6c2c459c6 python3(+0x13f9c6) [0x55d6c2c459c6]
35 0x55d6c2d3b256 PyEval_EvalCode + 134
36 0x55d6c2d66108 python3(+0x260108) [0x55d6c2d66108]
37 0x55d6c2d5f9cb python3(+0x2599cb) [0x55d6c2d5f9cb]
38 0x55d6c2d65e55 python3(+0x25fe55) [0x55d6c2d65e55]
39 0x55d6c2d65338 _PyRun_SimpleFileObject + 424
40 0x55d6c2d64f83 _PyRun_AnyFileObject + 67
41 0x55d6c2d57a5e Py_RunMain + 702
42 0x55d6c2d2e02d Py_BytesMain + 45
43 0x7fb6e5dabd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fb6e5dabd90]
44 0x7fb6e5dabe40 libc_start_main + 128
45 0x55d6c2d2df25 _start + 37
Process received signal
Signal: Aborted (6)
Signal code: (-6)
[ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fb6e5dc4520]
[ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fb6e5e189fc]
[ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fb6e5dc4476]
[ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fb6e5daa7f3]
[ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7fb643676b9e]
[ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fb64368220c]
[ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7fb6436811e9]
[ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x99)[0x7fb643681959]
[ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7fb6e5ab4884]
[ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7fb6e5ab52dd]
[10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7289c0)[0x7fb4aebbf9c0]
[11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x15da4e)[0x7fb461502a4e]
[12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins15AllreducePlugin10initializeEv+0xcc)[0x7fb4614edcec]
[13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x108c6e5)[0x7fb580fb46e5]
[14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1019de2)[0x7fb580f41de2]
[15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10219dd)[0x7fb580f499dd]
[16] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x102293d)[0x7fb580f4a93d]
[17] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1022da4)[0x7fb580f4ada4]
[18] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10577d0)[0x7fb580f7f7d0]
[19] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1058808)[0x7fb580f80808]
[20] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x105890b)[0x7fb580f8090b]
[21] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime11TllmRuntimeC1ERKNS0_9RawEngineEPN8nvinfer17ILoggerEfb+0x544)[0x7fb4b06ab004]
[22] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatchingC1ESt10shared_ptrIN8nvinfer17ILoggerEERKNS_7runtime11ModelConfigERKNS6_11WorldConfigERKNS6_9RawEngineEbRKNS0_25TrtGptModelOptionalParamsE+0x3c2)[0x7fb4b08c7c72]
[23] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl11createModelERKNS_7runtime9RawEngineERKNS3_11ModelConfigERKNS3_11WorldConfigERKNS0_14ExecutorConfigE+0x1a4)[0x7fb4b08eaaf4]
[24] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl9loadModelERKSt8optionalINSt10filesystem4pathEERKS3_ISt6vectorIhSaIhEEERKNS_7runtime13GptJsonConfigERKNS0_14ExecutorConfigEb+0x518)[0x7fb4b08eb388]
[25] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4ImplC2ERKNSt10filesystem4pathERKSt8optionalIS4_ENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x6e4)[0x7fb4b08f0b04]
[26] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8ExecutorC2ERKNSt10filesystem4pathENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x40)[0x7fb4b08e5ad0]
[27] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5df2)[0x7fb52bae8df2]
[28] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x58eac)[0x7fb52ba8beac]
[29] python3(+0x15a10e)[0x55d6c2c6010e]
End of error message
Aborted (core dumped)
additional notes
N/A