Finetuning got stuck after 10000 iterations.

alirezazareian / ovr-cnn

A new framework for open-vocabulary object detection, based on maskrcnn-benchmark

MIT License

229 stars 28 forks source link

Finetuning got stuck after 10000 iterations. #12

Closed yechenzhi closed 3 years ago

yechenzhi commented 3 years ago

I use 4 2080ti to finetune on the MSCOCO dataset, but the code got stuck after 10000 iterations. what happened? Recent logs are as follows:

ext.txt

The code has been stuck for over 18 hours and it does not run into next iteration. Should I keep waiting? By the way, do you only train your model for one epoch?

Originally posted by @yechenzhi in https://github.com/alirezazareian/ovr-cnn/issues/1#issuecomment-958642712

yechenzhi commented 3 years ago

log.txt

alirezazareian commented 3 years ago

It's strange. I have no idea what might be wrong, given the information you provided. Did you try it again? Did it freeze again exactly at that step? Did you check the CPU and GPU utilization and memory while it was stuck? The only thing special about 10,000 is that it tests and saves the model at that step. Maybe change the TEST_PERIOD and CHECKPOINT_PERIOD in the config to two different numbers other than 10,000 to see if one of them is triggering the problem.

yechenzhi commented 3 years ago

I tried it several times, all got stuck while viladation, so I simply set cfg.SOLVER.SKIP_VAL_LOSS to True, now the finetuning process gets normal. So the problem might be caused by DDP version conflict, due to my finetuning got stuct at 'synchronize()' .