NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.29k stars 925 forks source link

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' Assertion failed: Failed: MPI error #1826

Closed QLinfeng closed 3 months ago

QLinfeng commented 3 months ago

System Info

CentOS Linux release 7.9.2009

Nvida A40 * 4

llama-2-13b-hf

TensorRT-LLM version: 0.11.0.dev2024061800

Who can help?

No response

Information

Tasks

Reproduction

python3 convert_checkpoint.py --model_dir /media/Llama-2-13b-hf/ --output_dir /media/tllm_checkpoint_4gpu_tp4 --dtype float16 --tp_size 4

trtllm-build --checkpoint_dir /media/tllm_checkpoint_4gpu_tp4 --output_dir /media/llama-13b-engines-fp16-4gpu --gemm_plugin float16 --multi_block_mode enable

python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"

Expected behavior

Shorten reasoning time and answer questions correctly

actual behavior

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException' what(): [TensorRT-LLM][ERROR] Assertion failed: Failed: MPI error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/common/mpiUtils.cpp:211 '6' (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/common/mpiUtils.cpp:211) 1 0x7fb4aebbeb71 tensorrt_llm::common::throwRuntimeError(char const, int, std::string const&) + 82 2 0x7fb4aebbf999 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x728999) [0x7fb4aebbf999] 3 0x7fb461502a4e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x15da4e) [0x7fb461502a4e] 4 0x7fb4614edcec tensorrt_llm::plugins::AllreducePlugin::initialize() + 204 5 0x7fb580fb46e5 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x108c6e5) [0x7fb580fb46e5] 6 0x7fb580f41de2 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1019de2) [0x7fb580f41de2] 7 0x7fb580f499dd /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10219dd) [0x7fb580f499dd] 8 0x7fb580f4a93d /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x102293d) [0x7fb580f4a93d] 9 0x7fb580f4ada4 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1022da4) [0x7fb580f4ada4] 10 0x7fb580f7f7d0 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10577d0) [0x7fb580f7f7d0] 11 0x7fb580f80808 /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1058808) [0x7fb580f80808] 12 0x7fb580f8090b /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x105890b) [0x7fb580f8090b] 13 0x7fb4b06ab004 tensorrt_llm::runtime::TllmRuntime::TllmRuntime(tensorrt_llm::runtime::RawEngine const&, nvinfer1::ILogger, float, bool) + 1348 14 0x7fb4b08c7c72 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(std::shared_ptr, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::runtime::RawEngine const&, bool, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 962 15 0x7fb4b08eaaf4 tensorrt_llm::executor::Executor::Impl::createModel(tensorrt_llm::runtime::RawEngine const&, tensorrt_llm::runtime::ModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::executor::ExecutorConfig const&) + 420 16 0x7fb4b08eb388 tensorrt_llm::executor::Executor::Impl::loadModel(std::optional const&, std::optional<std::vector<unsigned char, std::allocator > > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::executor::ExecutorConfig const&, bool) + 1304 17 0x7fb4b08f0b04 tensorrt_llm::executor::Executor::Impl::Impl(std::filesystem::path const&, std::optional const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 1764 18 0x7fb4b08e5ad0 tensorrt_llm::executor::Executor::Executor(std::filesystem::path const&, tensorrt_llm::executor::ModelType, tensorrt_llm::executor::ExecutorConfig const&) + 64 19 0x7fb52bae8df2 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5df2) [0x7fb52bae8df2] 20 0x7fb52ba8beac /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x58eac) [0x7fb52ba8beac] 21 0x55d6c2c6010e python3(+0x15a10e) [0x55d6c2c6010e] 22 0x55d6c2c56a7b _PyObject_MakeTpCall + 603 23 0x55d6c2c6ec20 python3(+0x168c20) [0x55d6c2c6ec20] 24 0x55d6c2c6b087 python3(+0x165087) [0x55d6c2c6b087] 25 0x55d6c2c56e2b python3(+0x150e2b) [0x55d6c2c56e2b] 26 0x7fb52ba8b4cb /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x584cb) [0x7fb52ba8b4cb] 27 0x55d6c2c56a7b _PyObject_MakeTpCall + 603 28 0x55d6c2c4f629 _PyEval_EvalFrameDefault + 27257 29 0x55d6c2c6e7f1 python3(+0x1687f1) [0x55d6c2c6e7f1] 30 0x55d6c2c6f492 PyObject_Call + 290 31 0x55d6c2c4b5d7 _PyEval_EvalFrameDefault + 10791 32 0x55d6c2c609fc _PyFunction_Vectorcall + 124 33 0x55d6c2c4926d _PyEval_EvalFrameDefault + 1725 34 0x55d6c2c459c6 python3(+0x13f9c6) [0x55d6c2c459c6] 35 0x55d6c2d3b256 PyEval_EvalCode + 134 36 0x55d6c2d66108 python3(+0x260108) [0x55d6c2d66108] 37 0x55d6c2d5f9cb python3(+0x2599cb) [0x55d6c2d5f9cb] 38 0x55d6c2d65e55 python3(+0x25fe55) [0x55d6c2d65e55] 39 0x55d6c2d65338 _PyRun_SimpleFileObject + 424 40 0x55d6c2d64f83 _PyRun_AnyFileObject + 67 41 0x55d6c2d57a5e Py_RunMain + 702 42 0x55d6c2d2e02d Py_BytesMain + 45 43 0x7fb6e5dabd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fb6e5dabd90] 44 0x7fb6e5dabe40 libc_start_main + 128 45 0x55d6c2d2df25 _start + 37 Process received signal Signal: Aborted (6) Signal code: (-6) [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fb6e5dc4520] [ 1] /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fb6e5e189fc] [ 2] /lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fb6e5dc4476] [ 3] /lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fb6e5daa7f3] [ 4] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7fb643676b9e] [ 5] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fb64368220c] [ 6] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7fb6436811e9] [ 7] /lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x99)[0x7fb643681959] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7fb6e5ab4884] [ 9] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7fb6e5ab52dd] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x7289c0)[0x7fb4aebbf9c0] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x15da4e)[0x7fb461502a4e] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins15AllreducePlugin10initializeEv+0xcc)[0x7fb4614edcec] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x108c6e5)[0x7fb580fb46e5] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1019de2)[0x7fb580f41de2] [15] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10219dd)[0x7fb580f499dd] [16] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x102293d)[0x7fb580f4a93d] [17] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1022da4)[0x7fb580f4ada4] [18] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x10577d0)[0x7fb580f7f7d0] [19] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x1058808)[0x7fb580f80808] [20] /usr/local/lib/python3.10/dist-packages/tensorrt_libs/libnvinfer.so.10(+0x105890b)[0x7fb580f8090b] [21] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm7runtime11TllmRuntimeC1ERKNS0_9RawEngineEPN8nvinfer17ILoggerEfb+0x544)[0x7fb4b06ab004] [22] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager27TrtGptModelInflightBatchingC1ESt10shared_ptrIN8nvinfer17ILoggerEERKNS_7runtime11ModelConfigERKNS6_11WorldConfigERKNS6_9RawEngineEbRKNS0_25TrtGptModelOptionalParamsE+0x3c2)[0x7fb4b08c7c72] [23] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl11createModelERKNS_7runtime9RawEngineERKNS3_11ModelConfigERKNS3_11WorldConfigERKNS0_14ExecutorConfigE+0x1a4)[0x7fb4b08eaaf4] [24] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4Impl9loadModelERKSt8optionalINSt10filesystem4pathEERKS3_ISt6vectorIhSaIhEEERKNS_7runtime13GptJsonConfigERKNS0_14ExecutorConfigEb+0x518)[0x7fb4b08eb388] [25] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8Executor4ImplC2ERKNSt10filesystem4pathERKSt8optionalIS4_ENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x6e4)[0x7fb4b08f0b04] [26] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm8executor8ExecutorC2ERKNSt10filesystem4pathENS0_9ModelTypeERKNS0_14ExecutorConfigE+0x40)[0x7fb4b08e5ad0] [27] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0xb5df2)[0x7fb52bae8df2] [28] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x58eac)[0x7fb52ba8beac] [29] python3(+0x15a10e)[0x55d6c2c6010e] End of error message Aborted (core dumped)

additional notes

N/A

nv-guomingz commented 3 months ago

Please change python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"

to mpirun -n 4 --allow-run-as-root python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"

QLinfeng commented 3 months ago

I had this problem first, then I modified the parameters and then I got an MPI error. ################ Traceback (most recent call last): File "/media/TensorRT-LLM-main/examples/run.py", line 503, in main(args) File "/media/TensorRT-LLM-main/examples/run.py", line 340, in main runner = runner_cls.from_dir(*runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 202, in from_dir world_config = WorldConfig.mpi(tensor_parallelism=tp_size, RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp pp (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:99) ############## "mapping": { "world_size": 1, "gpus_per_node": 8, "tp_size": 1, "pp_size": 1, "moe_tp_size": 4, "moe_ep_size": 1 } I changed "tp_size":4 to "tp_size":1, "world_size":4 to "world_size":1

QLinfeng commented 3 months ago

@nv-guomingz I tried your command. When "mapping": { "world_size": 1, "gpus_per_node": 8, "tp_size": 1, "pp_size": 1, "moe_tp_size": 4, "moe_ep_size": 1 } I got error: Traceback (most recent call last): File "/media/TensorRT-LLM-main/examples/run.py", line 503, in main(args) File "/media/TensorRT-LLM-main/examples/run.py", line 340, in main runner = runner_cls.from_dir(*runner_kwargs) File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 202, in from_dir world_config = WorldConfig.mpi(tensor_parallelism=tp_size, RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: mpiSize == tp pp (/home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/runtime/worldConfig.cpp:99) 1 0x7ff3a6bbeb71 tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 82 2 0x7ff3a6bdf365 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x748365) [0x7ff3a6bdf365] 3 0x7ff423aa37a6 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x707a6) [0x7ff423aa37a6] 4 0x7ff423a8beac /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so(+0x58eac) [0x7ff423a8beac] 5 0x561ca0b4b10e python3(+0x15a10e) [0x561ca0b4b10e] 6 0x561ca0b41a7b _PyObject_MakeTpCall + 603 7 0x561ca0b3b150 _PyEval_EvalFrameDefault + 30112 8 0x561ca0b597f1 python3(+0x1687f1) [0x561ca0b597f1] 9 0x561ca0b5a492 PyObject_Call + 290 10 0x561ca0b365d7 _PyEval_EvalFrameDefault + 10791 11 0x561ca0b4b9fc _PyFunction_Vectorcall + 124 12 0x561ca0b3426d _PyEval_EvalFrameDefault + 1725 13 0x561ca0b309c6 python3(+0x13f9c6) [0x561ca0b309c6] 14 0x561ca0c26256 PyEval_EvalCode + 134 15 0x561ca0c51108 python3(+0x260108) [0x561ca0c51108] 16 0x561ca0c4a9cb python3(+0x2599cb) [0x561ca0c4a9cb] 17 0x561ca0c50e55 python3(+0x25fe55) [0x561ca0c50e55] 18 0x561ca0c50338 _PyRun_SimpleFileObject + 424 19 0x561ca0c4ff83 _PyRun_AnyFileObject + 67 20 0x561ca0c42a5e Py_RunMain + 702 21 0x561ca0c1902d Py_BytesMain + 45 22 0x7ff58de52d90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7ff58de52d90] 23 0x7ff58de52e40 __libc_start_main + 128 24 0x561ca0c18f25 _start + 37

QLinfeng commented 3 months ago

@nv-guomingz

"mapping": { "world_size": 4, "gpus_per_node": 8, "tp_size": 4, "pp_size": 1, "moe_tp_size": 4, "moe_ep_size": 1 } In this case, I got this error: [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors. [TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors. Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)' Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)' Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[9389,1],3] Exit code: 1

nv-guomingz commented 3 months ago

Hi @QLinfeng , may I know the exactly output of below commands w/o any modification on configuration file ?

python3 convert_checkpoint.py --model_dir /media/Llama-2-13b-hf/ --output_dir /media/tllm_checkpoint_4gpu_tp4 --dtype float16 --tp_size 4

trtllm-build --checkpoint_dir /media/tllm_checkpoint_4gpu_tp4 --output_dir /media/llama-13b-engines-fp16-4gpu --gemm_plugin float16 --multi_block_mode enable

mpirun -n 4 --allow-run-as-root python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello"
QLinfeng commented 3 months ago

@nv-guomingz python3 convert_checkpoint.py --model_dir /media/Llama-2-13b-hf/ --output_dir /media/tllm_checkpoint_4gpu_tp4 --dtype float16 --tp_size 4 Total time of converting checkpoints: 00:00:42

trtllm-build --checkpoint_dir /media/tllm_checkpoint_4gpu_tp4 --output_dir /media/llama-13b-engines-fp16-4gpu --gemm_plugin float16 --multi_block_mode enable [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [06/24/2024-05:45:38] [TRT-LLM] [I] Set bert_attention_plugin to auto. [06/24/2024-05:45:38] [TRT-LLM] [I] Set gpt_attention_plugin to auto. [06/24/2024-05:45:38] [TRT-LLM] [I] Set gemm_plugin to float16. [06/24/2024-05:45:38] [TRT-LLM] [I] Set gemm_swiglu_plugin to None. [06/24/2024-05:45:38] [TRT-LLM] [I] Set nccl_plugin to auto. [06/24/2024-05:45:38] [TRT-LLM] [I] Set lookup_plugin to None. [06/24/2024-05:45:38] [TRT-LLM] [I] Set lora_plugin to None. [06/24/2024-05:45:38] [TRT-LLM] [I] Set moe_plugin to auto. [06/24/2024-05:45:38] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto. [06/24/2024-05:45:38] [TRT-LLM] [I] Set context_fmha to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set context_fmha_fp32_acc to False. [06/24/2024-05:45:38] [TRT-LLM] [I] Set paged_kv_cache to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set remove_input_padding to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set reduce_fusion to False. [06/24/2024-05:45:38] [TRT-LLM] [I] Set multi_block_mode to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set enable_xqa to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set attention_qk_half_accumulation to False. [06/24/2024-05:45:38] [TRT-LLM] [I] Set tokens_per_block to 64. [06/24/2024-05:45:38] [TRT-LLM] [I] Set use_paged_context_fmha to False. [06/24/2024-05:45:38] [TRT-LLM] [I] Set use_fp8_context_fmha to False. [06/24/2024-05:45:38] [TRT-LLM] [I] Set multiple_profiles to False. [06/24/2024-05:45:38] [TRT-LLM] [I] Set paged_state to True. [06/24/2024-05:45:38] [TRT-LLM] [I] Set streamingllm to False. [06/24/2024-05:45:38] [TRT-LLM] [W] remove_input_padding is enabled, while opt_num_tokens is not set, setting to max_batch_size*max_beam_width.

[06/24/2024-05:45:38] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:45:38] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 146, GPU 4011 (MiB) [06/24/2024-05:45:41] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +1622, GPU +290, now: CPU 1915, GPU 4301 (MiB) [06/24/2024-05:45:41] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:45:41] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:45:41] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:45:41] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:45:41] [TRT] [W] Unused Input: position_ids [06/24/2024-05:45:41] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:45:41] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:45:41] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:45:41] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:45:45] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:45:45] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:45:52] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:45:52] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:45:52] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:45:52] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:45:52] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 54.2169ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:45:52] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:45:52] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:45:52] [TRT] [I] Engine generation completed in 10.2923 seconds. [06/24/2024-05:45:52] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:45:54] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 16366 MiB [06/24/2024-05:45:55] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:13 [06/24/2024-05:45:55] [TRT] [I] Serialized 26 bytes of code generator cache. [06/24/2024-05:45:55] [TRT] [I] Serialized 165354 bytes of compilation cache. [06/24/2024-05:45:55] [TRT] [I] Serialized 13 timing cache entries [06/24/2024-05:45:55] [TRT-LLM] [I] Timing cache serialized to model.cache [06/24/2024-05:45:55] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank0.engine... [06/24/2024-05:45:57] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:45:58] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:45:58] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2019, GPU 4323 (MiB) [06/24/2024-05:45:58] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:45:58] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:45:58] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:45:58] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:45:58] [TRT] [W] Unused Input: position_ids [06/24/2024-05:45:58] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:45:58] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:45:58] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:45:58] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:46:02] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:46:02] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:46:06] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:46:06] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:46:06] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:46:06] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:46:06] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 54.1244ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:46:06] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:46:06] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:46:06] [TRT] [I] Engine generation completed in 7.34736 seconds. [06/24/2024-05:46:06] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:46:08] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 22904 MiB [06/24/2024-05:46:09] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:10 [06/24/2024-05:46:09] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank1.engine... [06/24/2024-05:46:11] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:46:12] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:46:12] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2020, GPU 4323 (MiB) [06/24/2024-05:46:12] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:46:12] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:46:12] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:46:12] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:46:12] [TRT] [W] Unused Input: position_ids [06/24/2024-05:46:12] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:46:12] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:46:12] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:46:12] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:46:16] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:46:16] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:46:20] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:46:20] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:46:20] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:46:20] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:46:20] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 54.7622ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:46:20] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:46:20] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:46:20] [TRT] [I] Engine generation completed in 7.60425 seconds. [06/24/2024-05:46:20] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:46:23] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 22921 MiB [06/24/2024-05:46:23] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:10 [06/24/2024-05:46:23] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank2.engine... [06/24/2024-05:46:26] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:46:27] [TRT-LLM] [I] Set dtype to float16. [06/24/2024-05:46:27] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 2021, GPU 4323 (MiB) [06/24/2024-05:46:27] [TRT] [W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect. [06/24/2024-05:46:27] [TRT-LLM] [I] Set nccl_plugin to float16. [06/24/2024-05:46:27] [TRT-LLM] [I] Set use_custom_all_reduce to True. [06/24/2024-05:46:27] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0 [06/24/2024-05:46:27] [TRT] [W] Unused Input: position_ids [06/24/2024-05:46:27] [TRT] [W] Detected layernorm nodes in FP16. [06/24/2024-05:46:27] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy. [06/24/2024-05:46:27] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed. [06/24/2024-05:46:27] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored. [06/24/2024-05:46:31] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called. [06/24/2024-05:46:31] [TRT] [I] Detected 15 inputs and 1 output network tensors. [06/24/2024-05:46:35] [TRT] [I] Total Host Persistent Memory: 138784 [06/24/2024-05:46:35] [TRT] [I] Total Device Persistent Memory: 0 [06/24/2024-05:46:35] [TRT] [I] Total Scratch Memory: 167804928 [06/24/2024-05:46:35] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 778 steps to complete. [06/24/2024-05:46:35] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 55.4726ms to assign 17 blocks to 778 nodes requiring 453023744 bytes. [06/24/2024-05:46:35] [TRT] [I] Total Activation Memory: 453022720 [06/24/2024-05:46:35] [TRT] [I] Total Weights Memory: 6756411392 [06/24/2024-05:46:35] [TRT] [I] Engine generation completed in 7.62432 seconds. [06/24/2024-05:46:35] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 6444 MiB [06/24/2024-05:46:38] [TRT] [I] [MemUsageStats] Peak memory usage during Engine building and serialization: CPU: 22925 MiB [06/24/2024-05:46:38] [TRT-LLM] [I] Total time of building Unnamed Network 0: 00:00:10 [06/24/2024-05:46:38] [TRT-LLM] [I] Serializing engine to /media/llama-13b-engines-fp16-4gpu/rank3.engine... [06/24/2024-05:46:41] [TRT-LLM] [I] Engine serialized. Total time: 00:00:02 [06/24/2024-05:46:41] [TRT-LLM] [I] Total time of building all engines: 00:01:03

mpirun -n 4 --allow-run-as-root python3 run.py --engine_dir /media/llama-13b-engines-fp16-4gpu --max_output_len 1024 --tokenizer_dir /media/Llama-2-13b-hf/ --input_text "hello" [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM] TensorRT-LLM version: 0.11.0.dev2024061800 [TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors. [TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors. dc69d25252pa:21856:21856 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> dc69d25252pa:21856:21856 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dc69d25252pa:21856:21856 [0] NCCL INFO cudaDriverVersion 12030 NCCL version 2.20.5+cuda12.4 dc69d25252pa:21857:21857 [1] NCCL INFO cudaDriverVersion 12030 dc69d25252pa:21857:21857 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> dc69d25252pa:21859:21859 [3] NCCL INFO cudaDriverVersion 12030 dc69d25252pa:21859:21859 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> dc69d25252pa:21857:21857 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dc69d25252pa:21859:21859 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dc69d25252pa:21857:21857 [1] NCCL INFO NET/IB : No device found. dc69d25252pa:21857:21857 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> dc69d25252pa:21857:21857 [1] NCCL INFO Using non-device net plugin version 0 dc69d25252pa:21857:21857 [1] NCCL INFO Using network Socket dc69d25252pa:21859:21859 [3] NCCL INFO NET/IB : No device found. dc69d25252pa:21859:21859 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> dc69d25252pa:21859:21859 [3] NCCL INFO Using non-device net plugin version 0 dc69d25252pa:21859:21859 [3] NCCL INFO Using network Socket dc69d25252pa:21856:21856 [0] NCCL INFO NET/IB : No device found. dc69d25252pa:21856:21856 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> dc69d25252pa:21856:21856 [0] NCCL INFO Using non-device net plugin version 0 dc69d25252pa:21856:21856 [0] NCCL INFO Using network Socket dc69d25252pa:21858:21858 [2] NCCL INFO cudaDriverVersion 12030 dc69d25252pa:21858:21858 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> dc69d25252pa:21858:21858 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation dc69d25252pa:21858:21858 [2] NCCL INFO NET/IB : No device found. dc69d25252pa:21858:21858 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> dc69d25252pa:21858:21858 [2] NCCL INFO Using non-device net plugin version 0 dc69d25252pa:21858:21858 [2] NCCL INFO Using network Socket dc69d25252pa:21857:21857 [1] NCCL INFO comm 0x5615da582d70 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 65000 commId 0x4a851099a3e07480 - Init START dc69d25252pa:21859:21859 [3] NCCL INFO comm 0x55994c807510 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId e3000 commId 0x4a851099a3e07480 - Init START dc69d25252pa:21858:21858 [2] NCCL INFO comm 0x55e752e1ae10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId b1000 commId 0x4a851099a3e07480 - Init START dc69d25252pa:21856:21856 [0] NCCL INFO comm 0x5576ed7e1f10 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 4b000 commId 0x4a851099a3e07480 - Init START dc69d25252pa:21858:21858 [2] NCCL INFO NVLS multicast support is not available on dev 2 dc69d25252pa:21857:21857 [1] NCCL INFO NVLS multicast support is not available on dev 1 dc69d25252pa:21856:21856 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff dc69d25252pa:21856:21856 [0] NCCL INFO NVLS multicast support is not available on dev 0 dc69d25252pa:21859:21859 [3] NCCL INFO Setting affinity for GPU 3 to ffff0000,ffff0000 dc69d25252pa:21859:21859 [3] NCCL INFO NVLS multicast support is not available on dev 3 dc69d25252pa:21857:21857 [1] NCCL INFO comm 0x5615da582d70 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0 dc69d25252pa:21857:21857 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 dc69d25252pa:21857:21857 [1] NCCL INFO P2P Chunksize set to 131072 dc69d25252pa:21856:21856 [0] NCCL INFO comm 0x5576ed7e1f10 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0 dc69d25252pa:21856:21856 [0] NCCL INFO Channel 00/02 : 0 1 2 3 dc69d25252pa:21856:21856 [0] NCCL INFO Channel 01/02 : 0 1 2 3 dc69d25252pa:21858:21858 [2] NCCL INFO comm 0x55e752e1ae10 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0 dc69d25252pa:21858:21858 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 dc69d25252pa:21858:21858 [2] NCCL INFO P2P Chunksize set to 131072 dc69d25252pa:21859:21859 [3] NCCL INFO comm 0x55994c807510 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0 dc69d25252pa:21859:21859 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 dc69d25252pa:21859:21859 [3] NCCL INFO P2P Chunksize set to 131072 dc69d25252pa:21856:21856 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 dc69d25252pa:21856:21856 [0] NCCL INFO P2P Chunksize set to 131072

dc69d25252pa:21858:21858 [2] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-Y4njUk to 9637892 bytes

dc69d25252pa:21858:21858 [2] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-Y4njUk (size 9637888) dc69d25252pa:21858:21858 [2] NCCL INFO transport/shm.cc:114 -> 2

dc69d25252pa:21856:21856 [0] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-YNr3AB to 9637892 bytes

dc69d25252pa:21856:21856 [0] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-YNr3AB (size 9637888) dc69d25252pa:21856:21856 [0] NCCL INFO transport/shm.cc:114 -> 2 dc69d25252pa:21856:21856 [0] NCCL INFO transport.cc:33 -> 2

dc69d25252pa:21857:21857 [1] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-8EoRmY to 9637892 bytes

dc69d25252pa:21857:21857 [1] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-8EoRmY (size 9637888) dc69d25252pa:21857:21857 [1] NCCL INFO transport/shm.cc:114 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO transport.cc:33 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO transport.cc:113 -> 2

dc69d25252pa:21859:21859 [3] misc/shmutils.cc:72 NCCL WARN Error: failed to extend /dev/shm/nccl-u71YZA to 9637892 bytes

dc69d25252pa:21859:21859 [3] misc/shmutils.cc:113 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-u71YZA (size 9637888) dc69d25252pa:21859:21859 [3] NCCL INFO transport/shm.cc:114 -> 2 dc69d25252pa:21859:21859 [3] NCCL INFO transport.cc:33 -> 2 dc69d25252pa:21859:21859 [3] NCCL INFO transport.cc:113 -> 2 dc69d25252pa:21859:21859 [3] NCCL INFO init.cc:1222 -> 2 dc69d25252pa:21858:21858 [2] NCCL INFO transport.cc:33 -> 2 dc69d25252pa:21858:21858 [2] NCCL INFO transport.cc:113 -> 2 dc69d25252pa:21858:21858 [2] NCCL INFO init.cc:1222 -> 2 dc69d25252pa:21858:21858 [2] NCCL INFO init.cc:1501 -> 2 dc69d25252pa:21858:21858 [2] NCCL INFO init.cc:1746 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO init.cc:1222 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO init.cc:1501 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO init.cc:1746 -> 2 dc69d25252pa:21859:21859 [3] NCCL INFO init.cc:1501 -> 2 dc69d25252pa:21859:21859 [3] NCCL INFO init.cc:1746 -> 2 dc69d25252pa:21856:21856 [0] NCCL INFO transport.cc:113 -> 2 dc69d25252pa:21856:21856 [0] NCCL INFO init.cc:1222 -> 2 dc69d25252pa:21856:21856 [0] NCCL INFO init.cc:1501 -> 2 dc69d25252pa:21856:21856 [0] NCCL INFO init.cc:1746 -> 2 dc69d25252pa:21857:21857 [1] NCCL INFO init.cc:1784 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)' dc69d25252pa:21858:21858 [2] NCCL INFO init.cc:1784 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)' dc69d25252pa:21859:21859 [3] NCCL INFO init.cc:1784 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)' dc69d25252pa:21856:21856 [0] NCCL INFO init.cc:1784 -> 2 Failed, NCCL error /home/jenkins/agent/workspace/LLM/main/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/common/plugin.cpp:93 'unhandled system error (run with NCCL_DEBUG=INFO for details)'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[16192,1],1] Exit code: 1

QLinfeng commented 3 months ago

@nv-guomingz config.json { "version": "0.11.0.dev2024061800", "pretrained_config": { "mlp_bias": false, "attn_bias": false, "rotary_base": 10000.0, "rotary_scaling": null, "residual_mlp": false, "disable_weight_only_quant_plugin": false, "moe": { "num_experts": 0, "top_k": 0, "normalization_mode": null }, "architecture": "LlamaForCausalLM", "dtype": "float16", "vocab_size": 32000, "hidden_size": 5120, "num_hidden_layers": 40, "num_attention_heads": 40, "hidden_act": "silu", "logits_dtype": "float32", "norm_epsilon": 1e-05, "position_embedding_type": "rope_gpt_neox", "max_position_embeddings": 4096, "num_key_value_heads": 40, "intermediate_size": 13824, "mapping": { "world_size": 4, "gpus_per_node": 8, "tp_size": 4, "pp_size": 1, "moe_tp_size": 4, "moe_ep_size": 1 }, "quantization": { "quant_algo": null, "kv_cache_quant_algo": null, "group_size": 128, "smoothquant_val": null, "has_zero_point": false, "pre_quant_scale": false, "exclude_modules": null }, "use_parallel_embedding": false, "embedding_sharding_dim": 0, "share_embedding_table": false, "head_size": 128, "qk_layernorm": false }, "build_config": { "max_input_len": 1024, "max_seq_len": 2048, "opt_batch_size": null, "max_batch_size": 256, "max_beam_width": 1, "max_num_tokens": 8192, "opt_num_tokens": 256, "max_prompt_embedding_table_size": 0, "gather_context_logits": false, "gather_generation_logits": false, "strongly_typed": true, "builder_opt": null, "profiling_verbosity": "layer_names_only", "enable_debug_output": false, "max_draft_len": 0, "speculative_decoding_mode": 1, "use_refit": false, "input_timing_cache": null, "output_timing_cache": "model.cache", "lora_config": { "lora_dir": [], "lora_ckpt_source": "hf", "max_lora_rank": 64, "lora_target_modules": [], "trtllm_modules_to_hf_modules": {} }, "auto_parallel_config": { "world_size": 1, "gpus_per_node": 8, "cluster_key": "A40", "cluster_info": null, "sharding_cost_model": "alpha_beta", "comm_cost_model": "alpha_beta", "enable_pipeline_parallelism": false, "enable_shard_unbalanced_shape": false, "enable_shard_dynamic_shape": false, "enable_reduce_scatter": true, "builder_flags": null, "debug_mode": false, "infer_shape": true, "validation_mode": false, "same_buffer_io": { "past_keyvalue(\d+)": "present_keyvalue\1" }, "same_spec_io": {}, "sharded_io_allowlist": [ "past_keyvalue\d+", "present_keyvalue\d*" ], "fill_weights": false, "parallel_config_cache": null, "profile_cache": null, "dump_path": null, "debug_outputs": [] }, "weight_sparsity": false, "weight_streaming": false, "plugin_config": { "dtype": "float16", "bert_attention_plugin": "auto", "gpt_attention_plugin": "auto", "gemm_plugin": "float16", "gemm_swiglu_plugin": null, "smooth_quant_gemm_plugin": null, "identity_plugin": null, "layernorm_quantization_plugin": null, "rmsnorm_quantization_plugin": null, "nccl_plugin": "float16", "lookup_plugin": null, "lora_plugin": null, "weight_only_groupwise_quant_matmul_plugin": null, "weight_only_quant_matmul_plugin": null, "quantize_per_token_plugin": false, "quantize_tensor_plugin": false, "moe_plugin": "auto", "mamba_conv1d_plugin": "auto", "context_fmha": true, "context_fmha_fp32_acc": false, "paged_kv_cache": true, "remove_input_padding": true, "use_custom_all_reduce": true, "reduce_fusion": false, "multi_block_mode": true, "enable_xqa": true, "attention_qk_half_accumulation": false, "tokens_per_block": 64, "use_paged_context_fmha": false, "use_fp8_context_fmha": false, "multiple_profiles": false, "paged_state": true, "streamingllm": false }, "use_strip_plan": false, "max_encoder_input_len": 1024, "use_fused_mlp": false } }

nv-guomingz commented 3 months ago

Please increase your share memory size and try again.

Here is a similar issue, https://github.com/NVIDIA/TensorRT-LLM/issues/1702#issue-2325029617

QLinfeng commented 3 months ago

Please increase your share memory size and try again.

Here is a similar issue, #1702 (comment)

This problem has been solved, no more errors. Great, thank you!!!

Pareek-Yash commented 2 months ago

hey I tried increasing shm-size to 24g It still won't run. Here is my issue, #1950 , I've updated at the end of the issue.