Closed Erisura closed 1 month ago
hi @Erisura, did you meet this when running with our latest code?
no, with the original one, I find that this error is caused by a condition statement in trainer.py "if (g_it == 0 or (g_it + 1) % 500 == 0) and self.is_visualizer():" which is followed by an "allreduce" operation, there is a logic error here where you cannot do "allreduce" because it is a block that only the "visualizer" can enter. But I find that in your latest version, this has already been corrected. So no problem anymore; thank you for your great work!
I try to run train.py with one node and multi-gpu with command:![image](https://github.com/FoundationVision/VAR/assets/72057715/42fcab28-c58c-4c3e-8a1f-fa25f87dfc36)
torchrun --standalone --nproc_per_node=4 --nnodes=1 --node_rank=0 train.py --depth=16 --bs=64 --ep=200 --fp16=1 --alng=1e-3 --wpe=0.1
but I meet with error:[E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=18, OpType=BROADCAST, NumelIn=1073741824, NumelOut=1073741824, Timeout(ms)=1800000) ran for 1800517 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
after trying with solution of increasing 'timeout' parameter from https://github.com/huggingface/accelerate/issues/314#issuecomment-1260436540 I found the issue is not about timeout, it is about that the multi-processes are stuck with all GPU utilization equal to 100% all the time and gpu memory is different (gpu_0's memory usage is lower)I've never met with this issue, maybe someone has encountered it and already solved it? hope to get a reply.