NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.62k stars 981 forks source link

RuntimeError: Can't enable access between nodes 1 and 0 #2066

Open EASTERNTIGER opened 3 months ago

EASTERNTIGER commented 3 months ago

Hi, I tried to convert T5 model to tensorrt. I have a 4 GPUs devices.In the python convert_checkpoint.py step,I set tp_size=4,pp_size=1.Then I got tensorrt model successfully.However,when I use command :mpirun --allow-run-as-root -np 4 python3 run.py ,I got those errors

image when I set tp_size=1,pp_size=1 in the python convert_checkpoint.py step,I can run python3 run.py successfully. So how can I fixed this problem?It seems to be related with GPU setting,but I don't know how to do that. I also found a similar issue image but when I added --use_custom_all_reduce disable in trtllm-build,it showed unrecognized arguments image

OptimusV5 commented 3 months ago

same problem,seems that this argument has been removed #2008

Kefeng-Duan commented 2 months ago

Hi, @OptimusV5 @EASTERNTIGER Could you try to remove the this line: https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_ipc_utils.py#L42

yuxianq commented 2 months ago

@EASTERNTIGER @OptimusV5 This bug is known and has been fixed in both the main branch and v0.12, you can validate it with the main branch now or wait for the v0.12 release.

Kefeng-Duan commented 2 months ago

@EASTERNTIGER @OptimusV5 seems fixed here: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/runtime/ipcUtils.cpp#L47, please update your code and verify