Open MARMOTatZJU opened 4 years ago
Empirically, I discovered that it is necessary to halve the learning rate in order to training correctly on a 2-GPU machine instead of 4-GPU machine.
Hereby I provide the training log with 2 GPUs and 0.5x learning rate, whose loss logs match the official released training log.
It's weird. I use the detectron2's default GPU setting for training. Thank you for your advice!
[UPDATE] @fanq15 Halving the learning rate as well as num_gpus reproduced the expected result. Hereby I provide the training log with 2 GPUs & 0.5x SOLVER.BASE_LR for debuggin usage.
fsod_train_log.txt fsod_finetune_train_log.txt fsod_finetune_test_log.txt
With 4 GPUs, I rerun the defaults settings in all.sh and the AP is correct (11.27@voc 20 cats).
However, when I try to use another machines equiped with 2 GPUs, the loss_cls becomes strange and the AP at the end of training is near 0.
Hereby I provide my training log for debugging. fsod_train_log.txt
In comparison with the logs on 2-GPUmachine and 4-GPU machine, the loss_cls diverges before the iteration 2999 which can be seen as bellows:
2 GPUs
4 GPUs
From your code, I do not see anything related to num-gpus. Maybe it is due to the lack of some extra code needed by Detectron2 if num-gpus changes?