ashkamath / mdetr

Apache License 2.0
969 stars 125 forks source link

Loss increases during pretraining #35

Closed mmaaz60 closed 3 years ago

mmaaz60 commented 3 years ago

Hi @alcinos, @ashkamath, @nguyeho7,

I hope you are doing good.

I was trying to pretrain MDETR using the provided instructions. What I noticed is that loss started increasing during the 20th epoch. It kept decreasing to around 39 till the 19th epoch and jumped to around 77 after the 20th epoch. What could be the reason for this? Note that I am using the EfficientNetB5 backbone. The [log.txt]() is attached.

Thanks

log.txt

alcinos commented 3 years ago

Hi @mmaaz60 Thank you for your interest in MDETR. It looks like you training diverged. Can I ask how many gpus you used?

mmaaz60 commented 3 years ago

Hi @mmaaz60 Thank you for your interest in MDETR. It looks like you training diverged. Can I ask how many gpus you used?

Thank You @alcinos,

I used 32 GPUs with batch_size of 2 per GPU.

alcinos commented 3 years ago

Hum that’s quite surprising then. Nothing fishy happened, like the job getting preempted then restarted? Are you sure you have the correct transformers version? Otherwise mb try with a slightly smaller lr?

mmaaz60 commented 3 years ago

Thank You

Hum that’s quite surprising then. Nothing fishy happened, like the job getting preempted then restarted?

Nothing such happened during training

Are you sure you have the correct transformers version?

I am using transformers version 4.5.1

Otherwise mb try with a slightly smaller lr?

I actually stopped and then resumed the training from the 19th epoch and now it reaches to 25th epoch and seems to be converging. Not sure what went wrong previously as I didn't change anything when resuming.