Training core dump in megatron-lm with tp-comm-overlap.

XLzed commented 5 days ago

When use megatron-lm with transformer_engine, the training core dump with arguments --tp-comm-overlap.

environment

8xA800
Centos 7
Cuda Driver Version: 470.161.03

container

FROM nvcr.io/nvidia/pytorch:24.03-py3 RUN pip install --no-cache-dir tinydb flask sentencepiece git+https://github.com/NVIDIA/TransformerEngine.git@main

code version

megatron v0.7.0
transformer engine 1.9.0.dev0+f9dd37f

cmd

NCCL_IB_GID_INDEX=3 NCCL_NET_GDR_LEVEL=SYS torchrun --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 6000 pretrain_gpt.py --seq-length 4096 --num-layers 32 --hidden-size 4096 --ffn-hidden-size 11008 --num-attention-heads 32 --max-position-embeddings 4096 --disable-bias-linear --init-method-std 0.01 --attention-dropout 0.0 --hidden-dropout 0.0 --normalization RMSNorm --position-embedding-type rope --swiglu --untie-embeddings-and-output-weights --no-masked-softmax-fusion --no-position-embedding --tokenizer-type Llama2Tokenizer --tokenizer-model /workspace/train_perf/train_scripts/data_and_tokenizer/llama2/tokenizer/tokenizer.model --data-path /workspace/train_perf/train_scripts/data_and_tokenizer/llama2/preprocessed_data/CC_text_document --split 97,2,1 --micro-batch-size 1 --global-batch-size 8 --lr 1e-4 --train-iters 20 --lr-decay-iters 15 --lr-warmup-iters 5 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 0.1 --clip-grad 1.0 --bf16 --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --sequence-parallel --use-distributed-optimizer --use-flash-attn --use-mcore-models --overlap-grad-reduce --overlap-param-gather --log-interval 1 --save-interval 10000 --eval-interval 1000 --eval-iters 1 --log-throughput --no-load-optim --no-load-rng --tp-comm-overlap

log

`setting number of micro-batches to constant 2

building Llama2Tokenizer tokenizer ... padded vocab (size: 32000) with 0 dummy tokens (new size: 32000) initializing torch distributed ... initialized tensor model parallel with size 2 initialized pipeline model parallel with size 1 setting random seeds to 1234 ... compiling dataset index builder ... make: Entering directory '/workspace/train_perf/third_party/Megatron-LM/megatron/core/datasets' make: Nothing to be done for 'default'. make: Leaving directory '/workspace/train_perf/third_party/Megatron-LM/megatron/core/datasets'

done with dataset index builder. Compilation time: 0.306 seconds WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations. compiling and loading fused kernels ... done with compiling and loading fused kernels. Compilation time: 3.491 seconds [sbd2-cannon-35:4243 :0:4243] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) [sbd2-cannon-35:4237 :0:4237] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x100000008) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) [sbd2-cannon-35:4241 :0:4241] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil)) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8) ==== backtrace (tid: 4243) ==== 0 0x0000000000042520 sigaction() ???:0 1 0x000000000004d128 ompi_group_increment_proc_count() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_init.c:229 2 0x000000000004d128 opal_atomic_add_fetch_32() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/include/opal/sys/atomic_impl.h:384 3 0x000000000004d128 opal_thread_add_fetch_32() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/threads/thread_usage.h:152 4 0x000000000004d128 opal_obj_update() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/../opal/class/opal_object.h:534 5 0x000000000004d128 ompi_group_increment_proc_count() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_init.c:226 6 0x000000000004d9e9 ompi_group_incl_plist() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/group/group_plist.c:128 7 0x000000000007421b PMPI_Group_incl() /build-result/src/hpcx-v2.18-gcc-inbox-ubuntu22.04-cuda12-x86_64/ompi-efbeca7056b93dd17c67b66d1d514d39712e28d6/ompi/mpi/c/profile/pgroup_incl.c:87 8 0x0000000004de09ed c10d::ProcessGroupMPI::createProcessGroupMPI() ???:0 9 0x0000000000bfb750 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object)::{lambda(std::vector<int, std::allocator >)#74}, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type >, std::vector<int, std::allocator >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::call_guard >(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object, _object)::{lambda(std::vector<int, std::allocator >)#74}&&, c10::intrusive_ptr<c10d::ProcessGroupMPI, c10::detail::intrusive_target_default_null_type > (*)(std::vector<int, std::allocator >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::call_guard const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN() init.cpp:0 10 0x00000000004218a7 pybind11::cpp_function::dispatcher() :0 11 0x000000000015a10e PyObject_CallFunctionObjArgs() ???:0 12 0x0000000000150a7b _PyObject_MakeTpCall() ???:0 13 0x0000000000149629 _PyEval_EvalFrameDefault() ???:0 14 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 15 0x000000000014453c _PyEval_EvalFrameDefault() ???:0 16 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 17 0x000000000014453c _PyEval_EvalFrameDefault() ???:0 18 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 19 0x0000000000169492 PyObject_Call() ???:0 20 0x00000000001455d7 _PyEval_EvalFrameDefault() ???:0 21 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 22 0x000000000014453c _PyEval_EvalFrameDefault() ???:0 23 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 24 0x000000000014326d _PyEval_EvalFrameDefault() ???:0 25 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 26 0x000000000014453c _PyEval_EvalFrameDefault() ???:0 27 0x000000000015a9fc _PyFunction_Vectorcall() ???:0 28 0x000000000014453c _PyEval_EvalFrameDefault() ???:0 29 0x000000000013f9c6 _PyArg_ParseTuple_SizeT() ???:0 30 0x0000000000235256 PyEval_EvalCode() ???:0 31 0x0000000000260108 PyUnicode_Tailmatch() ???:0 32 0x00000000002599cb PyInitcollections() ???:0 33 0x000000000025fe55 PyUnicode_Tailmatch() ???:0 34 0x000000000025f338 _PyRun_SimpleFileObject() ???:0 35 0x000000000025ef83 _PyRun_AnyFileObject() ???:0 36 0x0000000000251a5e Py_RunMain() ???:0 37 0x000000000022802d Py_BytesMain() ???:0 38 0x0000000000029d90 libc_init_first() ???:0 39 0x0000000000029e40 libc_start_main() ???:0 40 0x0000000000227f25 _start() ???:0 `

timmoon10 commented 5 days ago

I see the callstack includes c10d::ProcessGroupMPI::createProcessGroupMPI, which implies the error is happening when PyTorch is initializing MPI. The most likely culprit is at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L207. This looks fine to me though, so we'll need to debug further.

The first thing to confirm is that MPI is properly configured on your system. Try running the following script with torchrun:

import torch
torch.distributed.init_process_group(backend="nccl")
torch.distributed.new_group(backend="mpi")

If that works, my suspicion falls on the Userbuffers initialization at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L209. Could there be some race condition where running initialize_ub on one process causes MPI initialization to fail on another process? One thing to try is commenting out the call to initialize_ub in Megatron-LM and seeing if it gets past that point (it'll probably error out during the first forward pass). Pinging @denera. It may also be helpful to take a look at https://github.com/NVIDIA/TransformerEngine/issues/827.

denera commented 4 days ago

I see a couple of things of concern here:

We removed MPI-dependence in Userbuffers with PR #901 (merged to TE/main) and recently fixed an initialization hang for certain use cases in PR #986 (not merged). Consequently, Megatron-LM does not need to initialize or launch with MPI anymore to do comm+GEMM overlap. You may need to update the initialization and test again with TE PR #986.
I see that the CUDA driver version is 470. I may be mistaken here but I recall that point to point comms via CUDA Multicast require 535+, so you may need to run this with UB_SKIPMC=1 in order to fall back onto the older CUDA IPC based implementation. The devices participating in the TP overlap also need to be on the same NVLink interconnect.

XLzed commented 4 days ago

I see the callstack includes c10d::ProcessGroupMPI::createProcessGroupMPI, which implies the error is happening when PyTorch is initializing MPI. The most likely culprit is at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L207. This looks fine to me though, so we'll need to debug further.

The first thing to confirm is that MPI is properly configured on your system. Try running the following script with torchrun:
import torch
torch.distributed.init_process_group(backend="nccl")
torch.distributed.new_group(backend="mpi")
If that works, my suspicion falls on the Userbuffers initialization at https://github.com/NVIDIA/Megatron-LM/blob/0bc3547702464501feefeb5523b7a17e591b21fa/megatron/training/initialize.py#L209. Could there be some race condition where running initialize_ub on one process causes MPI initialization to fail on another process? One thing to try is commenting out the call to initialize_ub in Megatron-LM and seeing if it gets past that point (it'll probably error out during the first forward pass). Pinging @denera. It may also be helpful to take a look at #827.

Thanks for your reply, the script run failed with same error. I will re-check the torch mpi backend inside container and try to comment mpi initialization in megatron-lm.

XLzed commented 4 days ago

I see a couple of things of concern here:

We removed MPI-dependence in Userbuffers with PR [C/PyTorch] Removed MPI dependence in Userbuffers #901 (merged to TE/main) and recently fixed an initialization hang for certain use cases in PR [PyTorch] Fixing hang in initialize_ub() for multi-node runs after PR901 removal of MPI-dependence #986 (not merged). Consequently, Megatron-LM does not need to initialize or launch with MPI anymore to do comm+GEMM overlap. You may need to update the initialization and test again with TE PR [PyTorch] Fixing hang in initialize_ub() for multi-node runs after PR901 removal of MPI-dependence #986.

I see that the CUDA driver version is 470. I may be mistaken here but I recall that point to point comms via CUDA Multicast require 535+, so you may need to run this with UB_SKIPMC=1 in order to fall back onto the older CUDA IPC based implementation. The devices participating in the TP overlap also need to be on the same NVLink interconnect.

Thanks! I deleted the mpi initialization in megatorn-lm and used the code in PR you mentioned, but the following error occurred whatever I set UB_SKIPMC=1 or not. Is it because of the nccl version problem or other reasons? The nccl version in my current test environment is 2.20.5.

setting number of micro-batches to constant 2
> building Llama2Tokenizer tokenizer ...
 > padded vocab (size: 32000) with 0 dummy tokens (new size: 32000)
> initializing torch distributed ...
> initialized tensor model parallel with size 2
> initialized pipeline model parallel with size 1
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/workspace/train_perf/third_party/Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/workspace/train_perf/third_party/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.317 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
>>> done with compiling and loading fused kernels. Compilation time: 3.465 seconds
!!! [UB] Create UbufP2PCommOverlap Communicator
UB_TIMEOUT is set to 110 sec, 155100000000 cycles, freq: 1410000khz
MC NOT initialized and used
UB: warning region 1 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 1
UB: warning region 2 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 2
UB: warning region 3 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 3
UB: warning region 4 size 32 MB registered without MC access
!!! [UBP2P] Register UBuf 4
UB: warning region 5 size 32 MB registered without MC access
!!! [UB] Register UBuf 5
UB: warning region 6 size 32 MB registered without MC access
!!! [UB] Register UBuf 6
UB: warning region 7 size 32 MB registered without MC access
!!! [UB] Register UBuf 7
UB: warning region 8 size 32 MB registered without MC access
!!! [UB] Register UBuf 8
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''
Failed, NCCL error /tmp/pip-req-build-7n8asrs4/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp:501 ''

XLzed commented 7 hours ago

@denera @timmoon10 Thanks a lot! After recompiling the transformer-engine with mpi and using mpirun to launch megatron-lm training, the tp-comm-overlap works without any problems now. Hopes the version that does not depend on mpi will be released soon.

denera commented 7 hours ago

@XLzed We have already removed MPI dependence for this in TE/main (see PR #901), and PR #986 for fixing a multi-node bug introduced with PR #901 should merge soon too, pending confirmation of the fix from NeMo/Megatron. Hopefully you'll be able to use this without MPI soon.

NVIDIA / TransformerEngine