NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.24k stars 913 forks source link

Encountered an error: peer access is not supported between these two devices in tensorrt_llm::runtime::IpcMemory::allocateIpcMemory() #1498

Open liu21yd opened 5 months ago

liu21yd commented 5 months ago

I built TensoRT-LLM 0.9.0 from source code base on nvcr.io/nvidia/tritonserver:24.02-py3 , and run scripts or commands from https://github.com/NVIDIA/TensorRT-LLM/blob/main/docker/Dockerfile.multi.

I convert the checkpoint and built engine successfully by following command:

python3 convert_checkpoint.py \
                        --model_dir xxxx \
                        --output_dirxxxx \
                        --dtype bfloat16 \
                        --tp_size 2 \
                        --pp_size 1 \
                        --vocab_size 124792

trtllm-build --checkpoint_dir xxxx \
             --output_dir xxxx \
             --use_context_fmha_for_generation enable \
             --gemm_plugin bfloat16 \
             --gpt_attention_plugin bfloat16 \
             --context_fmha enable \
             --paged_kv_cache enable \
             --remove_input_padding enable \
             --tokens_per_block 1024 \
             --max_batch_size 1 \
             --max_input_len 8192 \
             --max_output_len 4096 \
             --max_num_tokens 8192 \
             --max_beam_width 1 \
             --tp_size 2 \
             --pp_size 1 

When I start triton server, I Encountered an error:

E0425 10:43:31.433641 841 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaIpcOpenMemHandle( reinterpret_cast<void>(&foreignBuffer), handles[nodeId], cudaIpcMemLazyEnablePeerAccess): peer access is not supported between these two devices (/app/TensorRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/ipcUtils.cpp:90) 1 0x7f6ad0647c05 /app/TensorRT-LLM/tensorrtllm_backend/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xcbc05) [0x7f6ad0647c05] 2 0x7f6ad07824fd tensorrt_llm::runtime::IpcMemory::allocateIpcMemory() + 685 3 0x7f6ad07826eb tensorrt_llm::runtime::IpcMemory::IpcMemory(tensorrt_llm::runtime::WorldConfig const&, unsigned long) + 283 4 0x7f6ad083d3a3 tensorrt_llm::batch_manager::RuntimeBuffers::createCustomAllReduceWorkspace(int, int, int, int, tensorrt_llm::runtime::WorldConfig const&) + 323 5 0x7f6ad083e5c3 tensorrt_llm::batch_manager::RuntimeBuffers::create(int, int, int, int, int, tensorrt_llm::runtime::TllmRuntime&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::optional<std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > > const&) + 3427 6 0x7f6ad084c693 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createBuffers(std::optional<std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > > const&) + 163 7 0x7f6ad0850149 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2153 8 0x7f6ad081333c tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 796 9 0x7f6ad08142e2 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 738 10 0x7f6ad0808c05 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 309 11 0x7f6afcbf0439 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, ompi_communicator_t) + 4761 12 0x7f6afcbf1409 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, triton::backend::inflight_batcher_llm::ModelInstanceState) + 73 13 0x7f6b4003f447 TRITONBACKEND_ModelInstanceInitialize + 743 14 0x7f6b4dd32296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7f6b4dd32296] 15 0x7f6b4dd334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7f6b4dd334d6] 16 0x7f6b4dd16045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7f6b4dd16045] 17 0x7f6b4dd16686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7f6b4dd16686] 18 0x7f6b4dd22efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7f6b4dd22efd] 19 0x7f6b4d386ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f6b4d386ee8] 20 0x7f6b4dd0cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7f6b4dd0cf0b] 21 0x7f6b4dd1dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7f6b4dd1dc65] 22 0x7f6b4dd2231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7f6b4dd2231e] 23 0x7f6b4de140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7f6b4de140c8] 24 0x7f6b4de179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7f6b4de179ac] 25 0x7f6b4df6b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7f6b4df6b6c2] 26 0x7f6b4d5f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f6b4d5f2253] 27 0x7f6b4d381ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f6b4d381ac3] 28 0x7f6b4d412a04 clone + 68 E0425 10:43:31.433756 841 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaIpcOpenMemHandle( reinterpret_cast<void>(&foreignBuffer), handles[nodeId], cudaIpcMemLazyEnablePeerAccess): peer access is not supported between these two devices (/app/TensorRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/ipcUtils.cpp:90) 1 0x7f6ad0647c05 /app/TensorRT-LLM/tensorrtllm_backend/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xcbc05) [0x7f6ad0647c05] 2 0x7f6ad07824fd tensorrt_llm::runtime::IpcMemory::allocateIpcMemory() + 685 3 0x7f6ad07826eb tensorrt_llm::runtime::IpcMemory::IpcMemory(tensorrt_llm::runtime::WorldConfig const&, unsigned long) + 283 4 0x7f6ad083d3a3 tensorrt_llm::batch_manager::RuntimeBuffers::createCustomAllReduceWorkspace(int, int, int, int, tensorrt_llm::runtime::WorldConfig const&) + 323 5 0x7f6ad083e5c3 tensorrt_llm::batch_manager::RuntimeBuffers::create(int, int, int, int, int, tensorrt_llm::runtime::TllmRuntime&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::optional<std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > > const&) + 3427 6 0x7f6ad084c693 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createBuffers(std::optional<std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > > const&) + 163 7 0x7f6ad0850149 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2153 8 0x7f6ad081333c tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 796 9 0x7f6ad08142e2 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 738 10 0x7f6ad0808c05 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 309 11 0x7f6afcbf0439 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, ompi_communicator_t) + 4761 12 0x7f6afcbf1409 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, triton::backend::inflight_batcher_llm::ModelInstanceState) + 73 13 0x7f6b4003f447 TRITONBACKEND_ModelInstanceInitialize + 743 14 0x7f6b4dd32296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7f6b4dd32296] 15 0x7f6b4dd334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7f6b4dd334d6] 16 0x7f6b4dd16045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7f6b4dd16045] 17 0x7f6b4dd16686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7f6b4dd16686] 18 0x7f6b4dd22efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7f6b4dd22efd] 19 0x7f6b4d386ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f6b4d386ee8] 20 0x7f6b4dd0cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7f6b4dd0cf0b] 21 0x7f6b4dd1dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7f6b4dd1dc65] 22 0x7f6b4dd2231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7f6b4dd2231e] 23 0x7f6b4de140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7f6b4de140c8] 24 0x7f6b4de179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7f6b4de179ac] 25 0x7f6b4df6b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7f6b4df6b6c2] 26 0x7f6b4d5f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f6b4d5f2253] 27 0x7f6b4d381ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f6b4d381ac3] 28 0x7f6b4d412a04 clone + 68 I0425 10:43:31.433823 841 model_lifecycle.cc:773] failed to load 'tensorrt_llm' E0425 10:43:31.433970 841 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in cudaIpcOpenMemHandle( reinterpret_cast<void*>(&foreignBuffer), handles[nodeId], cudaIpcMemLazyEnablePeerAccess): peer access is not supported between these two devices (/app/TensorRT-LLM/TensorRT-LLM/cpp/tensorrt_llm/runtime/ipcUtils.cpp:90) 1 0x7f6ad0647c05 /app/TensorRT-LLM/tensorrtllm_backend/inflight_batcher_llm/../tensorrt_llm/cpp/build/tensorrt_llm/libtensorrt_llm.so(+0xcbc05) [0x7f6ad0647c05] 2 0x7f6ad07824fd tensorrt_llm::runtime::IpcMemory::allocateIpcMemory() + 685 3 0x7f6ad07826eb tensorrt_llm::runtime::IpcMemory::IpcMemory(tensorrt_llm::runtime::WorldConfig const&, unsigned long) + 283 4 0x7f6ad083d3a3 tensorrt_llm::batch_manager::RuntimeBuffers::createCustomAllReduceWorkspace(int, int, int, int, tensorrt_llm::runtime::WorldConfig const&) + 323 5 0x7f6ad083e5c3 tensorrt_llm::batch_manager::RuntimeBuffers::create(int, int, int, int, int, tensorrt_llm::runtime::TllmRuntime&, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::optional<std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > > const&) + 3427 6 0x7f6ad084c693 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::createBuffers(std::optional<std::vector<std::vector<int, std::allocator >, std::allocator<std::vector<int, std::allocator > > > > const&) + 163 7 0x7f6ad0850149 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 2153 8 0x7f6ad081333c tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::vector<unsigned char, std::allocator > const&, tensorrt_llm::runtime::GptJsonConfig const&, tensorrt_llm::runtime::WorldConfig const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 796 9 0x7f6ad08142e2 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 738 10 0x7f6ad0808c05 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr, std::allocator<std::shared_ptr > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash, std::equal_to, std::allocator > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional, std::optional, bool) + 309 11 0x7f6afcbf0439 triton::backend::inflight_batcher_llm::ModelInstanceState::ModelInstanceState(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, ompi_communicator_t) + 4761 12 0x7f6afcbf1409 triton::backend::inflight_batcher_llm::ModelInstanceState::Create(triton::backend::inflight_batcher_llm::ModelState, TRITONBACKEND_ModelInstance, triton::backend::inflight_batcher_llm::ModelInstanceState**) + 73 13 0x7f6b4003f447 TRITONBACKEND_ModelInstanceInitialize + 743 14 0x7f6b4dd32296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7f6b4dd32296] 15 0x7f6b4dd334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7f6b4dd334d6] 16 0x7f6b4dd16045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7f6b4dd16045] 17 0x7f6b4dd16686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7f6b4dd16686] 18 0x7f6b4dd22efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7f6b4dd22efd] 19 0x7f6b4d386ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f6b4d386ee8] 20 0x7f6b4dd0cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7f6b4dd0cf0b] 21 0x7f6b4dd1dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7f6b4dd1dc65] 22 0x7f6b4dd2231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7f6b4dd2231e] 23 0x7f6b4de140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7f6b4de140c8] 24 0x7f6b4de179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7f6b4de179ac] 25 0x7f6b4df6b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7f6b4df6b6c2] 26 0x7f6b4d5f2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f6b4d5f2253] 27 0x7f6b4d381ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f6b4d381ac3] 28 0x7f6b4d412a04 clone + 68;

The NCCL INFO is :

9d31476b19d7:840:858 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> 9d31476b19d7:840:858 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. 9d31476b19d7:840:858 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) 9d31476b19d7:840:858 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. 9d31476b19d7:840:858 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) 9d31476b19d7:840:858 [0] NCCL INFO cudaDriverVersion 12000 NCCL version 2.19.3+cuda12.3 9d31476b19d7:840:858 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 9d31476b19d7:840:858 [0] NCCL INFO P2P plugin IBext 9d31476b19d7:840:858 [0] NCCL INFO NET/IB : No device found. 9d31476b19d7:840:858 [0] NCCL INFO NET/IB : No device found. 9d31476b19d7:840:858 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> 9d31476b19d7:840:858 [0] NCCL INFO Using non-device net plugin version 0 9d31476b19d7:840:858 [0] NCCL INFO Using network Socket 9d31476b19d7:841:854 [1] NCCL INFO cudaDriverVersion 12000 9d31476b19d7:841:854 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.4<0> 9d31476b19d7:841:854 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol. 9d31476b19d7:841:854 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6) 9d31476b19d7:841:854 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol. 9d31476b19d7:841:854 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6) 9d31476b19d7:841:854 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so 9d31476b19d7:841:854 [1] NCCL INFO P2P plugin IBext 9d31476b19d7:841:854 [1] NCCL INFO NET/IB : No device found. 9d31476b19d7:841:854 [1] NCCL INFO NET/IB : No device found. 9d31476b19d7:841:854 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.4<0> 9d31476b19d7:841:854 [1] NCCL INFO Using non-device net plugin version 0 9d31476b19d7:841:854 [1] NCCL INFO Using network Socket 9d31476b19d7:841:854 [1] NCCL INFO comm 0x7f5f5fc72b30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1e000 commId 0x7422ec99aa7ed400 - Init START 9d31476b19d7:840:858 [0] NCCL INFO comm 0x7f436bcfecb0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1d000 commId 0x7422ec99aa7ed400 - Init START 9d31476b19d7:841:854 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:841:854 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:841:854 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:841:854 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:841:854 [1] NCCL INFO Setting affinity for GPU 1 to ffff,0000ffff 9d31476b19d7:840:858 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:840:858 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:840:858 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:840:858 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1. 9d31476b19d7:840:858 [0] NCCL INFO Setting affinity for GPU 0 to ffff,0000ffff 9d31476b19d7:840:858 [0] NCCL INFO Channel 00/02 : 0 1 9d31476b19d7:840:858 [0] NCCL INFO Channel 01/02 : 0 1 9d31476b19d7:840:858 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 9d31476b19d7:840:858 [0] NCCL INFO P2P Chunksize set to 131072 9d31476b19d7:841:854 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 9d31476b19d7:841:854 [1] NCCL INFO P2P Chunksize set to 131072 9d31476b19d7:841:854 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct 9d31476b19d7:841:854 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct 9d31476b19d7:840:858 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct 9d31476b19d7:840:858 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct 9d31476b19d7:841:854 [1] NCCL INFO Connected all rings 9d31476b19d7:841:854 [1] NCCL INFO Connected all trees 9d31476b19d7:840:858 [0] NCCL INFO Connected all rings 9d31476b19d7:840:858 [0] NCCL INFO Connected all trees 9d31476b19d7:841:854 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 9d31476b19d7:841:854 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 9d31476b19d7:840:858 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512 9d31476b19d7:840:858 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer 9d31476b19d7:841:854 [1] NCCL INFO comm 0x7f5f5fc72b30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1e000 commId 0x7422ec99aa7ed400 - Init COMPLETE 9d31476b19d7:840:858 [0] NCCL INFO comm 0x7f436bcfecb0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1d000 commId 0x7422ec99aa7ed400 - Init COMPLETE

Who can help me?

byshiue commented 4 months ago

Could you try adding --use_custom_all_reduce disable when you build the engine?

dtlzhuangz commented 4 months ago

Could you try adding --use_custom_all_reduce disable when you build the engine?

I face the same error and the solution works!

byshiue commented 4 months ago

The issue is caused by the network topo. If your network topo does not support peer access, then custom_all_reduce is not supported and need to disable it.

liu21yd commented 4 months ago

Could you try adding --use_custom_all_reduce disable when you build the engine?

It works , thank you very much.