Open anchorbob opened 10 months ago
please set NCCL_DEBUG=INFO, run the tests again, and see if we can get more detailed logs.
@PerkzZheng Please see my log with NCCL_DEBUG=INFO below:
dc:255754:255754 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dc:255754:255754 [1] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer dc:255756:255756 [3] NCCL INFO comm 0x55e4ee26b070 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255754:255754 [1] NCCL INFO comm 0x563c749eae90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 43000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255755:255755 [2] NCCL INFO comm 0x55f62569e9d0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 52000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255753:255753 [0] NCCL INFO comm 0x55c6f20b7fc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x7849b58874b511a0 - Init COMPLETE
dc:255756:255756 [3] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255756:255756 [3] NCCL INFO enqueue.cc:1283 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:569 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:945 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:130 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:325 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:406 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
dc:255754:255754 [1] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255754:255754 [1] NCCL INFO enqueue.cc:1283 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:569 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:945 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:130 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:325 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:406 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
dc:255753:255753 [0] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255753:255753 [0] NCCL INFO enqueue.cc:1283 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:569 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:945 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:130 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:325 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:406 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[47368,1],2] Exit code: 1
did you set the ALGO explicitly ? and could you try nccl-tests with the same environment ?
@PerkzZheng I did not set ALGO explicitly. Could you please provide more info on how to run nccl-tests? Does nccl-test also support different batch size and input len?
This error only happens when batch size is equal or greater than 128 and input len=2048. I don't think it is a general nccl issue.
@anchorbob thanks. we got similar reports from other users. Note that we are on track of this issue, and we will keep you posted if we find any solutions.
@PerkzZheng , is there any update on this issue? Thanks
would u please try our latest code base to see if the issue still exists?
And do u still have further issue or question now? If not, we'll close it soon.
System Info
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Expected to get valid perf number as other batch and input length combination. But it failed
actual behavior
BS: 128, ISL/OSL: 2048,1
Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
additional notes
This only happens for batch=128 and input=2048. Other combinations (like batch=64 and input=2048) work well.