NCCL errors while running LLAMA2 70b benchmark shmoo with batch size=128 and input length=2048 on 4 H100 GPUs

anchorbob commented 8 months ago

System Info

CPU Arch x86
4 H100 CPUs
using commit 6cc5e177ff2fb60b1aab3b03fa0534b5181cf0f1

Who can help?

@kaiyux @byshiue

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

python examples/llama/build.py \ --remove_input_padding \ --enable_context_fmha \ --parallel_build \ --output_dir /tmp/engines/llama/70b \ --dtype float16 \ --use_gpt_attention_plugin float16 \ --world_size 4 \ --tp_size 4 \ --pp_size 1 \ --max_batch_size 128 \ --max_input_len 2048 \ --max_output_len 2048 \ --enable_fp8 \ --fp8_kv_cache \ --strongly_typed \ --n_layer 80 \ --n_head 64 \ --n_kv_head 8 \ --n_embd 8192 \ --inter_size 28672 \ --vocab_size 32000 \ --n_positions 4096 \ --hidden_act silu \ --ffn_dim_multiplier 1.3 \ --multiple_of 4096
mpirun -n 4 --allow-run-as-root --oversubscribe ./cpp/build/benchmarks/gptSessionBenchmark --model llama --engine_dir /tmp/engines/llama/70b --warm_up 1 --batch_size 128 --duration 0 --num_runs 5 --input_output_len 2048,1 done

Expected behavior

Expected to get valid perf number as other batch and input length combination. But it failed

actual behavior

BS: 128, ISL/OSL: 2048,1
Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers' Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name:   [[39943,1],1]                                               
Exit    code:   1

additional notes

This only happens for batch=128 and input=2048. Other combinations (like batch=64 and input=2048) work well.

PerkzZheng commented 8 months ago

please set NCCL_DEBUG=INFO, run the tests again, and see if we can get more detailed logs.

anchorbob commented 8 months ago

@PerkzZheng Please see my log with NCCL_DEBUG=INFO below:

dc:255754:255754 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512 dc:255754:255754 [1] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer dc:255756:255756 [3] NCCL INFO comm 0x55e4ee26b070 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255754:255754 [1] NCCL INFO comm 0x563c749eae90 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 43000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255755:255755 [2] NCCL INFO comm 0x55f62569e9d0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 52000 commId 0x7849b58874b511a0 - Init COMPLETE dc:255753:255753 [0] NCCL INFO comm 0x55c6f20b7fc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x7849b58874b511a0 - Init COMPLETE

dc:255756:255756 [3] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255756:255756 [3] NCCL INFO enqueue.cc:1283 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:569 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:945 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:130 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:325 -> 3 dc:255756:255756 [3] NCCL INFO group.cc:406 -> 3 dc:255756:255756 [3] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'

dc:255754:255754 [1] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255754:255754 [1] NCCL INFO enqueue.cc:1283 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:569 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:945 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:130 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:325 -> 3 dc:255754:255754 [1] NCCL INFO group.cc:406 -> 3 dc:255754:255754 [1] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'

dc:255753:255753 [0] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255753:255753 [0] NCCL INFO enqueue.cc:1283 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:569 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:945 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:130 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:325 -> 3 dc:255753:255753 [0] NCCL INFO group.cc:406 -> 3 dc:255753:255753 [0] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'

dc:255755:255755 [2] enqueue.cc:1182 NCCL WARN Error : no algorithm/protocol available dc:255755:255755 [2] NCCL INFO enqueue.cc:1283 -> 3 dc:255755:255755 [2] NCCL INFO enqueue.cc:569 -> 3 dc:255755:255755 [2] NCCL INFO enqueue.cc:945 -> 3 dc:255755:255755 [2] NCCL INFO group.cc:130 -> 3 dc:255755:255755 [2] NCCL INFO group.cc:325 -> 3 dc:255755:255755 [2] NCCL INFO group.cc:406 -> 3 dc:255755:255755 [2] NCCL INFO enqueue.cc:1594 -> 3 Failed, NCCL error /code/tensorrt_llm/cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp:184 'internal error - please report this issue to the NCCL developers'

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[47368,1],2] Exit code: 1

PerkzZheng commented 8 months ago

did you set the ALGO explicitly ? and could you try nccl-tests with the same environment ?

anchorbob commented 7 months ago

@PerkzZheng I did not set ALGO explicitly. Could you please provide more info on how to run nccl-tests? Does nccl-test also support different batch size and input len?

This error only happens when batch size is equal or greater than 128 and input len=2048. I don't think it is a general nccl issue.

PerkzZheng commented 7 months ago

@anchorbob thanks. we got similar reports from other users. Note that we are on track of this issue, and we will keep you posted if we find any solutions.

anchorbob commented 7 months ago

@PerkzZheng , is there any update on this issue? Thanks

NVIDIA / TensorRT-LLM