THUDM / CogCoM

Other
145 stars 9 forks source link

添加'crop_and_zoomin'操作后训练会卡死 #25

Open terryII opened 2 months ago

terryII commented 2 months ago

如上所示,若数据集中没有'crop_and_zoomin'操作时,则训练可以正常,但添加该操作后,训练会卡在fintune.py程序broadcast_auto_com函数中的mpu.broadcast_data下的torch.distributed.broadcast操作,然后返回如下结果: `

[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out. [rank7]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out. [rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank7]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank7]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0ea77ced87 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f0ea89934d6 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f0ea8996a2d in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f0ea8997629 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f0ef424bbf4 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x8609 (0x7f0efdccf609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f0efda9a353 in /lib/x86_64-linux-gnu/libc.so.6)

[rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1704987288773/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa40e725d87 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa40f8ea4d6 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa40f8eda2d in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa40f8ee629 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7fa45b1a2bf4 in /home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x8609 (0x7fa464c26609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fa4649f1353 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-07-22 14:13:38,511] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10525 [2024-07-22 14:13:40,488] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10526 [2024-07-22 14:13:42,465] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10527 [2024-07-22 14:13:45,510] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10528 [2024-07-22 14:13:47,367] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10529 [2024-07-22 14:13:49,382] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10530 [2024-07-22 14:13:51,357] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10531 [2024-07-22 14:13:51,365] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 10532

` 请问这种情况该如何解决? @qijimrc

terryII commented 2 months ago

而且官方com数据集也会出现该种情况,训练硬件为8xA10(24G),MP_SIZE=4,torch=2.2.0,cuda=12.1