about multi gpu training

AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

http://pjreddie.com/darknet/

Other

21.7k stars 7.96k forks source link

about multi gpu training #1899

Open fengxiuyaun opened 5 years ago

fengxiuyaun commented 5 years ago

Hi, alexey. I have some question about burn_in and max_batches in multi GPU training。 for example. when I train using single GPU, burn_in =1000, max_batches =20000. why using 4 GPU, burn_in=10004, max_batches =200004? I think those should stay the same. Here's my explanation: When using single GPU, bach_size=64, max_batches=20000, then net.seen=2000064 images will be trained. When using multi GPU(such as 4), bach_size=64, the same number of images are trained, so max_batches=20000. because multi GPU training(such as 4) is equivalent to a single GPU iteration 4 times and net.seen increase by 4batch_size

AlexeyAB commented 5 years ago

@fengxiuyaun Hi,

There is about it: https://github.com/AlexeyAB/darknet/issues/1165#issuecomment-421334900

fengxiuyaun commented 5 years ago

@AlexeyAB ,hi, alexeyab. I found that only printing is wrong.(https://github.com/AlexeyAB/darknet/blob/14ed6fcb6e31dd111fc5c35c31ffa6e45fe52737/src/detector.c#L204) , it should be

printf("Iters:%d, loss:%f, avg_loss:%f, rate:%f, time:%lf s, imgs:%d \n", i+(i%4)*(ngpus-1), loss, avg_loss, get_current_rate(net), what_time_is_it_now()-time, (i+(i%4)*(ngpus-1))*net->batch * net->subdivisions);

i+(i%4)*(ngpus-1) is number of batch_sizer_cfg, in other words, is iteration relative to single-gpu training. so, max_batches remain unchanged whether sing-gpu training or multi, if we want to train the same number of images.

AlexeyAB commented 5 years ago

@fengxiuyaun You should change this line, because get_current_batch(net) gets wrong current iteration number for multi-GPU training: https://github.com/AlexeyAB/darknet/blob/14ed6fcb6e31dd111fc5c35c31ffa6e45fe52737/src/detector.c#L133