Closed yechenzhi closed 3 years ago
It's strange. I have no idea what might be wrong, given the information you provided. Did you try it again? Did it freeze again exactly at that step? Did you check the CPU and GPU utilization and memory while it was stuck? The only thing special about 10,000 is that it tests and saves the model at that step. Maybe change the TEST_PERIOD
and CHECKPOINT_PERIOD
in the config to two different numbers other than 10,000 to see if one of them is triggering the problem.
I tried it several times, all got stuck while viladation, so I simply set cfg.SOLVER.SKIP_VAL_LOSS to True, now the finetuning process gets normal. So the problem might be caused by DDP version conflict, due to my finetuning got stuct at 'synchronize()' .
I use 4 2080ti to finetune on the MSCOCO dataset, but the code got stuck after 10000 iterations. what happened? Recent logs are as follows:
ext.txt
The code has been stuck for over 18 hours and it does not run into next iteration. Should I keep waiting? By the way, do you only train your model for one epoch?
Originally posted by @yechenzhi in https://github.com/alirezazareian/ovr-cnn/issues/1#issuecomment-958642712