Closed AmingWu closed 3 years ago
It should take about 10 hours to pretrain and 18 hours to fine-tune on the MSCOCO dataset, using 8 V-100 GPUs. Please refer to Section 4.2 (implementation details) of our paper for more details.
I use 4 2080ti to finetune on the MSCOCO dataset, but the code got stuck after 10000 iterations. what happened? Recent logs are as follows:
The code has been stuck for over 18 hours and it does not run into next iteration. Should I keep waiting? By the way, do you only train your model for one epoch?
It's strange. I have no idea what might be wrong, given the information you provided. Did you try it again? Did it freeze again exactly at that step? Did you check the CPU and GPU utilization and memory while it was stuck? The only thing special about 10,000 is that it tests and saves the model at that step. Maybe change the TEST_PERIOD
and CHECKPOINT_PERIOD
in the config to two different numbers other than 10,000 to see if one of them is triggering the problem.
I use 4 2080ti to finetune on the MSCOCO dataset, but the code got stuck after 10000 iterations. what happened? Recent logs are as follows:
The code has been stuck for over 18 hours and it does not run into next iteration. Should I keep waiting? By the way, do you only train your model for one epoch?
Hi I meet the same problem, you can add 'SKIP_VAL_LOSS: True' in cfg files to skip val process , I guess this is bug in newer Pytorch version, of course may be other problem?
Dear Authors,
How long does this project need to run?