Open letdivedeep opened 2 months ago
In your train.py,
from datetime import timedelta
dist.init_process_group(backend="nccl" if dist.is_nccl_available() else "gloo", timeout=timedelta(seconds=18000))
Kindly let me know if above solution works for you. (Default timeout for NCCL backend is 30 mins. The above code I provided sets it to 5 hours.)
@optimisticabouthumanity Thanks for your suggested changes, it helped not to get an timeout error next time.
But now when we resume the model training from the previously saved checkpoints, the training speed remains the same as earlier. Still, the validation steps take 10x more time to complete than the previous time (earlier it used to take only 9.50 mins to complete the validation step). Do you have any idea why such a thing may happen?
@optimisticabouthumanity @WongKinYiu To add more to this issue, when we try to resume the model training with no_val = true, after resuming 2 epochs at 3epoch the model training gets stuck and doesn't proceed as seen in the screenshot below
You can see when the training is stopped the CPU utilization is drop drastically.
Also, I tried to use these prev stopped epochs as pre-trained weights by using --weight rather than --resume, but the issue persists.
I am using the AWS p4d.24xlarge instance for model training. This is my env:
CUDA version : 11.3
Pytorch version : 1.10.0
Python version : 3.7.11
@WongKinYiu @ws6125 @optimisticabouthumanity, Is this resume checkpoint a known issue?
@WongKinYiu @optimisticabouthumanity YoloV9 DDP model training stuck after certain epoch is as bug in the training code base. Requesting you to please look into this
Hi @WongKinYiu @ws6125 Thanks for the wonderful work.
When starting the model training on a custom dataset on multi GPU setup, after few epochs i get the
WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801678 milliseconds before timing out.
error. I am using the aws g5.24xlarge instances for this tracelogs which is a 4gpu instance(A10 GPU).I am using the CUDA 11.4 with NCCL version as 2.11.4-1. I also tried reducing the batchsize and increasing the docker shared memory still the issue persits.
below is the docker cmd I am using to start the docker :
docker run -it --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 -p 4446:8888 -p 4445:6006 --name dashcam-yolov9-de -it -v /home/ubuntu/dataset:/yolov9/dataset/ -v /home/ubuntu/saved_models:/yolov9/saved_models/ --shm-size=64g dashcam-aws-yolov9:v1
and then for running the model training, I am using this command :python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 segment/train.py --workers 88 --device 0,1,2,3 --sync-bn --batch 24 --data data/coco_road_signs_segment.yaml --img 1280 --cfg models/segment/gelan-c-seg.yaml --weights gelan-c-seg.pt --name gelan-c-seg --hyp hyp.scratch-high.yaml --epochs 60 --close-mosaic 15 --save-period 4 --label-smoothing 0.08 --noval --optimizer 'AdamW' --project saved_models/yolov9_seg_chk/
I Below is the short version of the error stack:I also tried with different instance types such as AWS p4d.24xlarge which is a 8GPU(A100), still the same error. Any pointer on this will be very helpful.