Issues with num_gpus - Githubissues

fanq15 / FewX

FewX is an open-source toolbox on top of Detectron2 for data-limited instance-level recognition tasks.

https://github.com/fanq15/FewX

MIT License

346 stars 48 forks source link

Issues with num_gpus #6

Open MARMOTatZJU opened 4 years ago

MARMOTatZJU commented 4 years ago

With 4 GPUs, I rerun the defaults settings in all.sh and the AP is correct (11.27@voc 20 cats).

However, when I try to use another machines equiped with 2 GPUs, the loss_cls becomes strange and the AP at the end of training is near 0.

Hereby I provide my training log for debugging. fsod_train_log.txt

In comparison with the logs on 2-GPUmachine and 4-GPU machine, the loss_cls diverges before the iteration 2999 which can be seen as bellows:

2 GPUs

[32m[08/13 11:32:08 d2.utils.events]: [0m eta: 2 days, 10:28:56  iter: 2999  total_loss: 0.935  loss_cls: 0.566  loss_box_reg: 0.255  loss_rpn_cls: 0.061  loss_rpn_loc: 0.015  time: 1.8013  data_time: 0.0312  lr: 0.004000  max_mem: 7442M

4 GPUs

[32m[08/07 11:07:45 d2.utils.events]: [0m eta: 1 day, 5:50:52  iter: 2999  total_loss: 0.811  loss_cls: 0.476  loss_box_reg: 0.226  loss_rpn_cls: 0.077  loss_rpn_loc: 0.019  time: 0.9198  data_time: 0.0164  lr: 0.004000  max_mem: 4173M

From your code, I do not see anything related to num-gpus. Maybe it is due to the lack of some extra code needed by Detectron2 if num-gpus changes?

MARMOTatZJU commented 4 years ago

Empirically, I discovered that it is necessary to halve the learning rate in order to training correctly on a 2-GPU machine instead of 4-GPU machine.

Hereby I provide the training log with 2 GPUs and 0.5x learning rate, whose loss logs match the official released training log.

fsod_train_log.txt

fanq15 commented 4 years ago

It's weird. I use the detectron2's default GPU setting for training. Thank you for your advice!

MARMOTatZJU commented 4 years ago

[UPDATE] @fanq15 Halving the learning rate as well as num_gpus reproduced the expected result. Hereby I provide the training log with 2 GPUs & 0.5x SOLVER.BASE_LR for debuggin usage.

fsod_train_log.txt fsod_finetune_train_log.txt fsod_finetune_test_log.txt