training with multi-gpu but stuck

Erisura commented 2 months ago

I try to run train.py with one node and multi-gpu with command: torchrun --standalone --nproc_per_node=4 --nnodes=1 --node_rank=0 train.py --depth=16 --bs=64 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1 but I meet with error: [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18, OpType=BROADCAST, NumelIn=1073741824, NumelOut=1073741824, Timeout(ms)=1800000) ran for 1800517 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. after trying with solution of increasing 'timeout' parameter from https://github.com/huggingface/accelerate/issues/314#issuecomment-1260436540 I found the issue is not about timeout, it is about that the multi-processes are stuck with all GPU utilization equal to 100% all the time and gpu memory is different (gpu_0's memory usage is lower)

I've never met with this issue, maybe someone has encountered it and already solved it? hope to get a reply.

keyu-tian commented 1 month ago

hi @Erisura, did you meet this when running with our latest code?

Erisura commented 1 month ago

no, with the original one, I find that this error is caused by a condition statement in trainer.py "if (g_it == 0 or (g_it + 1) % 500 == 0) and self.is_visualizer():" which is followed by an "allreduce" operation, there is a logic error here where you cannot do "allreduce" because it is a block that only the "visualizer" can enter. But I find that in your latest version, this has already been corrected. So no problem anymore; thank you for your great work!

FoundationVision / VAR

training with multi-gpu but stuck #47