Error Running Program on 4 GPU

witnessai commented 2 weeks ago

Running the program on 4 GPUs, an error occurs at line 343 of train_multidatasets.py, getting stuck at the line results = evaluator.evaluate() in the inference_on_dataset function, The error message is as follows:

[rank0]:[E1023 05:18:20.595188032 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=102100, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.
[rank0]:[E1023 05:18:20.595800828 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 102100, last enqueued NCCL work: 102100, last completed NCCL work: 102099.
[rank0]:[E1023 05:18:20.595825114 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 102100, last enqueued NCCL work: 102100, last completed NCCL work: 102099.
[rank0]:[E1023 05:18:20.595834763 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E1023 05:18:20.595845633 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E1023 05:18:20.597074192 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=102100, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800059 milliseconds before timing out.

But running the program on 2 GPUs does not result in an error. Do you know what the reason might be? Is it related to the shell code that only uses 2 GPUs to run the program?

JarintotionDin commented 2 weeks ago

Did you change the '--num-gpus' as 4 ? I don't have 4 GPUs for testing, and I have no idea what has happened when run with 4 GPUs by now.

witnessai commented 2 weeks ago

Yes, I set --num-gpus 4

JarintotionDin commented 2 weeks ago

Did you changed the batch size? It should be at least 4.

witnessai commented 2 weeks ago

Yes, the batch size is set to 8

JarintotionDin / ZiRaGroundingDINO

Error Running Program on 4 GPU #4