Closed XLzed closed 7 hours ago
I see the callstack includes c10d::ProcessGroupMPI::createProcessGroupMPI
, which implies the error is happening when PyTorch is initializing MPI. The most likely culprit is at
https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L207. This looks fine to me though, so we'll need to debug further.
The first thing to confirm is that MPI is properly configured on your system. Try running the following script with torchrun
:
import torch
torch.distributed.init_process_group(backend="nccl")
torch.distributed.new_group(backend="mpi")
If that works, my suspicion falls on the Userbuffers initialization at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L209. Could there be some race condition where running initialize_ub
on one process causes MPI initialization to fail on another process? One thing to try is commenting out the call to initialize_ub
in Megatron-LM and seeing if it gets past that point (it'll probably error out during the first forward pass). Pinging @denera. It may also be helpful to take a look at https://github.com/NVIDIA/TransformerEngine/issues/827.
I see a couple of things of concern here:
UB_SKIPMC=1
in order to fall back onto the older CUDA IPC based implementation. The devices participating in the TP overlap also need to be on the same NVLink interconnect.I see the callstack includes
c10d::ProcessGroupMPI::createProcessGroupMPI
, which implies the error is happening when PyTorch is initializing MPI. The most likely culprit is at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L207. This looks fine to me though, so we'll need to debug further.The first thing to confirm is that MPI is properly configured on your system. Try running the following script with
torchrun
:import torch torch.distributed.init_process_group(backend="nccl") torch.distributed.new_group(backend="mpi")
If that works, my suspicion falls on the Userbuffers initialization at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L209. Could there be some race condition where running
initialize_ub
on one process causes MPI initialization to fail on another process? One thing to try is commenting out the call toinitialize_ub
in Megatron-LM and seeing if it gets past that point (it'll probably error out during the first forward pass). Pinging @denera. It may also be helpful to take a look at #827.
Thanks for your reply, the script run failed with same error. I will re-check the torch mpi backend inside container and try to comment mpi initialization in megatron-lm.
I see a couple of things of concern here:
- We removed MPI-dependence in Userbuffers with PR [C/PyTorch] Removed MPI dependence in Userbuffers #901 (merged to TE/main) and recently fixed an initialization hang for certain use cases in PR [PyTorch] Fixing hang in
initialize_ub()
for multi-node runs after PR901 removal of MPI-dependence #986 (not merged). Consequently, Megatron-LM does not need to initialize or launch with MPI anymore to do comm+GEMM overlap. You may need to update the initialization and test again with TE PR [PyTorch] Fixing hang ininitialize_ub()
for multi-node runs after PR901 removal of MPI-dependence #986.- I see that the CUDA driver version is 470. I may be mistaken here but I recall that point to point comms via CUDA Multicast require 535+, so you may need to run this with
UB_SKIPMC=1
in order to fall back onto the older CUDA IPC based implementation. The devices participating in the TP overlap also need to be on the same NVLink interconnect.
Thanks! I deleted the mpi initialization in megatorn-lm and used the code in PR you mentioned, but the following error occurred whatever I set UB_SKIPMC=1
or not. Is it because of the nccl version problem or other reasons? The nccl version in my current test environment is 2.20.5.
setting number of micro-batches to constant 2
> building Llama2Tokenizer tokenizer ...
> padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/workspace/train_perf/third_party/Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/train_perf/third_party/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.317 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 3.465 seconds
!!! [UB] Create UbufP2PCommOverlap Communicator
UB_TIMEOUT is set to 110 sec, 155100000000 cycles, freq: 1410000khz
MC NOT initialized and used
UB: warning region 1 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 1
UB: warning region 2 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 2
UB: warning region 3 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 3
UB: warning region 4 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 4
UB: warning region 5 size 32 MB registered without MC access
!!! [UB] Register UBuf 5
UB: warning region 6 size 32 MB registered without MC access
!!! [UB] Register UBuf 6
UB: warning region 7 size 32 MB registered without MC access
!!! [UB] Register UBuf 7
UB: warning region 8 size 32 MB registered without MC access
!!! [UB] Register UBuf 8
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
@denera @timmoon10 Thanks a lot! After recompiling the transformer-engine with mpi and using mpirun to launch megatron-lm training, the tp-comm-overlap works without any problems now. Hopes the version that does not depend on mpi will be released soon.
@XLzed We have already removed MPI dependence for this in TE/main (see PR #901), and PR #986 for fixing a multi-node bug introduced with PR #901 should merge soon too, pending confirmation of the fix from NeMo/Megatron. Hopefully you'll be able to use this without MPI soon.
When use megatron-lm with transformer_engine, the training core dump with arguments --tp-comm-overlap.
environment
container
FROM nvcr.io/nvidia/pytorch:24.03-py3 RUN pip install --no-cache-dir tinydb flask sentencepiece git+https://github.com/NVIDIA/TransformerEngine.git@main
code version
cmd
NCCL_IB_GID_INDEX=3 NCCL_NET_GDR_LEVEL=SYS torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6000 pretrain_gpt.py --seq-length 4096 --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --max-position-embeddings 4096 --disable-bias-linear --init-method-std 0.01 --attention-dropout 0.0 --hidden-dropout 0.0 --normalization RMSNorm --position-embedding-type rope --swiglu --untie-embeddings-and-output-weights --no-masked-softmax-fusion --no-position-embedding --tokenizer-type Llama2Tokenizer --tokenizer-model /workspace/train_perf/train_scripts/data_and_tokenizer/llama2/tokenizer/tokenizer.model --data-path /workspace/train_perf/train_scripts/data_and_tokenizer/llama2/preprocessed_data/CC_text_document --split 97,2,1 --micro-batch-size 1 --global-batch-size 8 --lr 1e-4 --train-iters 20 --lr-decay-iters 15 --lr-warmup-iters 5 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 0.1 --clip-grad 1.0 --bf16 --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --sequence-parallel --use-distributed-optimizer --use-flash-attn --use-mcore-models --overlap-grad-reduce --overlap-param-gather --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 1 --log-throughput --no-load-optim --no-load-rng --tp-comm-overlap
log
`setting number of micro-batches to constant 2