Training Time - Githubissues

alirezazareian / ovr-cnn

A new framework for open-vocabulary object detection, based on maskrcnn-benchmark

MIT License

229 stars 28 forks source link

Training Time #1

Closed AmingWu closed 3 years ago

AmingWu commented 3 years ago

Dear Authors,

How long does this project need to run?

alirezazareian commented 3 years ago

It should take about 10 hours to pretrain and 18 hours to fine-tune on the MSCOCO dataset, using 8 V-100 GPUs. Please refer to Section 4.2 (implementation details) of our paper for more details.

yechenzhi commented 3 years ago

I use 4 2080ti to finetune on the MSCOCO dataset, but the code got stuck after 10000 iterations. what happened? Recent logs are as follows:

ext.txt

The code has been stuck for over 18 hours and it does not run into next iteration. Should I keep waiting? By the way, do you only train your model for one epoch?

yechenzhi commented 3 years ago

the full log is as as follow: log.txt

alirezazareian commented 3 years ago

It's strange. I have no idea what might be wrong, given the information you provided. Did you try it again? Did it freeze again exactly at that step? Did you check the CPU and GPU utilization and memory while it was stuck? The only thing special about 10,000 is that it tests and saves the model at that step. Maybe change the TEST_PERIOD and CHECKPOINT_PERIOD in the config to two different numbers other than 10,000 to see if one of them is triggering the problem.

Martin0401 commented 2 years ago

I use 4 2080ti to finetune on the MSCOCO dataset, but the code got stuck after 10000 iterations. what happened? Recent logs are as follows:

ext.txt

The code has been stuck for over 18 hours and it does not run into next iteration. Should I keep waiting? By the way, do you only train your model for one epoch?

Hi I meet the same problem, you can add 'SKIP_VAL_LOSS: True' in cfg files to skip val process , I guess this is bug in newer Pytorch version, of course may be other problem?