NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k stars 2.28k forks source link

[BUG] llava pipeline parallel initialization problem #998

Open KookHoiKim opened 1 month ago

KookHoiKim commented 1 month ago

Describe the bug I am currently working with llava model in megatron. I tested tensor parallel and it works well. However, when i set pipeline parallel, it stucks while initialization. I found that in initialize_model_parallel , group_gloo = torch.distributed.new_group(ranks, backend="gloo" is not passed for rank1 gpu. I am using 2 A100 gpus , so I set TP=1 . If anyone have any idea, please help. Thanks.

FYI, i add NCCL_DEBUG info .

run297066-megatron:76841:76841 [0] NCCL INFO Bootstrap : Using eth0:10.15.140.244<0>
run297066-megatron:76841:76841 [0] NCCL INFO cudaDriverVersion 12050
run297066-megatron:76841:76841 [0] NCCL INFO NCCL version 2.22.3+cuda12.5
> setting tensorboard ...
WARNING: one_logger package is required to enable e2e metrics tracking. please go to https://confluence.nvidia.com/display/MLWFO/Package+Repositories for details to install it
run297066-megatron:76841:76991 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
run297066-megatron:76841:76991 [0] NCCL INFO P2P plugin IBext_v8
run297066-megatron:76841:76991 [0] NCCL INFO NET/IB : No device found.
run297066-megatron:76841:76991 [0] NCCL INFO NET/IB : No device found.
run297066-megatron:76841:76991 [0] NCCL INFO NET/Socket : Using [0]eth0:10.15.140.244<0>
run297066-megatron:76841:76991 [0] NCCL INFO Using network Socket
run297066-megatron:76842:76842 [1] NCCL INFO cudaDriverVersion 12050
run297066-megatron:76842:76842 [1] NCCL INFO Bootstrap : Using eth0:10.15.140.244<0>
run297066-megatron:76842:76842 [1] NCCL INFO NCCL version 2.22.3+cuda12.5
run297066-megatron:76842:76996 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
run297066-megatron:76842:76996 [1] NCCL INFO P2P plugin IBext_v8
run297066-megatron:76842:76996 [1] NCCL INFO NET/IB : No device found.
run297066-megatron:76842:76996 [1] NCCL INFO NET/IB : No device found.
run297066-megatron:76842:76996 [1] NCCL INFO NET/Socket : Using [0]eth0:10.15.140.244<0>
run297066-megatron:76842:76996 [1] NCCL INFO Using network Socket
run297066-megatron:76842:76996 [1] NCCL INFO ncclCommInitRank comm 0x55db1283efc0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId cb000 commId 0x279e4fc0fff67dc0 - Init START
run297066-megatron:76841:76991 [0] NCCL INFO ncclCommInitRank comm 0x5577888c8650 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0x279e4fc0fff67dc0 - Init START
run297066-megatron:76841:76991 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
run297066-megatron:76842:76996 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000,00000000
run297066-megatron:76841:76991 [0] NCCL INFO comm 0x5577888c8650 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
run297066-megatron:76842:76996 [1] NCCL INFO comm 0x55db1283efc0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
run297066-megatron:76841:76991 [0] NCCL INFO Channel 00/24 :    0   1
run297066-megatron:76842:76996 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 01/24 :    0   1
run297066-megatron:76842:76996 [1] NCCL INFO P2P Chunksize set to 524288
run297066-megatron:76841:76991 [0] NCCL INFO Channel 02/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 03/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 04/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 05/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 06/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 07/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 08/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 09/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 10/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 11/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 12/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 13/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 14/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 15/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 16/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 17/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 18/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 19/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 20/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 21/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 22/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Channel 23/24 :    0   1
run297066-megatron:76841:76991 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
run297066-megatron:76841:76991 [0] NCCL INFO P2P Chunksize set to 524288
run297066-megatron:76842:76996 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
run297066-megatron:76842:76996 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
run297066-megatron:76841:76991 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
run297066-megatron:76841:76991 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
run297066-megatron:76841:76991 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
run297066-megatron:76842:76996 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
run297066-megatron:76842:76996 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
run297066-megatron:76842:76996 [1] NCCL INFO ncclCommInitRank comm 0x55db1283efc0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId cb000 commId 0x279e4fc0fff67dc0 - Init COMPLETE
run297066-megatron:76841:76991 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
run297066-megatron:76842:76996 [1] NCCL INFO Init timings: rank 1 nranks 2 total 0.21 (kernels 0.10, bootstrap 0.01, allgathers 0.00, topo 0.08, graphs 0.00, connections 0.02, rest 0.01)
run297066-megatron:76841:76991 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
run297066-megatron:76841:76991 [0] NCCL INFO ncclCommInitRank comm 0x5577888c8650 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0x279e4fc0fff67dc0 - Init COMPLETE
run297066-megatron:76841:76991 [0] NCCL INFO Init timings: rank 0 nranks 2 total 0.48 (kernels 0.11, bootstrap 0.26, allgathers 0.02, topo 0.07, graphs 0.00, connections 0.02, rest 0.01)
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76841:76841 [0] NCCL INFO Rank 0 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
run297066-megatron:76842:76842 [1] NCCL INFO Rank 1 has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator
> initialized tensor model parallel with size 1
> initialized pipeline model parallel with size 2
> setting random seeds to 1234 ...
> compiling dataset index builder ...
make: Entering directory '/khoi.kim/workspace/code/Megatron-LM/megatron/core/datasets'
make: Nothing to be done for 'default'.
make: Leaving directory '/khoi.kim/workspace/code/Megatron-LM/megatron/core/datasets'
>>> done with dataset index builder. Compilation time: 0.118 seconds
WARNING: constraints for invoking optimized fused softmax kernel are not met. We default back to unfused kernel invocations.
> compiling and loading fused kernels ...
run297066-megatron:76841:77141 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
run297066-megatron:76841:77141 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
KookHoiKim commented 1 month ago

I am working with nvcr.io/nvidia/pytorch:24.07-py3 image. It installed torch==2.4.0 . When comment out the code L.256~257 in initialize.py, initialization does not stuck anymore.

        # Call the init process
        init_process_group_kwargs = {
            'backend' : args.distributed_backend,
            'world_size': args.world_size,
            'rank': args.rank,
            'timeout': timedelta(minutes=args.distributed_timeout_minutes),
        }
        # if packaging.version.Version(torch.__version__) >= packaging.version.Version("2.3.0"):
        #     init_process_group_kwargs['device_id'] = device_id