Open fengxiuyaun opened 5 years ago
@fengxiuyaun Hi,
There is about it: https://github.com/AlexeyAB/darknet/issues/1165#issuecomment-421334900
@AlexeyAB ,hi, alexeyab. I found that only printing is wrong.(https://github.com/AlexeyAB/darknet/blob/14ed6fcb6e31dd111fc5c35c31ffa6e45fe52737/src/detector.c#L204) , it should be
printf("Iters:%d, loss:%f, avg_loss:%f, rate:%f, time:%lf s, imgs:%d \n", i+(i%4)*(ngpus-1), loss, avg_loss, get_current_rate(net), what_time_is_it_now()-time, (i+(i%4)*(ngpus-1))*net->batch * net->subdivisions);
i+(i%4)*(ngpus-1) is number of batch_sizer_cfg, in other words, is iteration relative to single-gpu training. so, max_batches remain unchanged whether sing-gpu training or multi, if we want to train the same number of images.
@fengxiuyaun You should change this line, because get_current_batch(net)
gets wrong current iteration number for multi-GPU training: https://github.com/AlexeyAB/darknet/blob/14ed6fcb6e31dd111fc5c35c31ffa6e45fe52737/src/detector.c#L133
Read more: https://github.com/AlexeyAB/darknet/issues/1165#issuecomment-421334900
Hi, alexey. I have some question about burn_in and max_batches in multi GPU training。 for example. when I train using single GPU, burn_in =1000, max_batches =20000. why using 4 GPU, burn_in=10004, max_batches =200004? I think those should stay the same. Here's my explanation: When using single GPU, bach_size=64, max_batches=20000, then net.seen=2000064 images will be trained. When using multi GPU(such as 4), bach_size=64, the same number of images are trained, so max_batches=20000. because multi GPU training(such as 4) is equivalent to a single GPU iteration 4 times and net.seen increase by 4batch_size