Performance reproduction : learning rate step

facebookresearch / unbiased-teacher

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

https://arxiv.org/abs/2102.09480

MIT License

415 stars 83 forks source link

Performance reproduction : learning rate step #36

Open JongMokKim opened 3 years ago

JongMokKim commented 3 years ago

Thanks for the great work.

I have just been training the code with your default config file as instructed in README.MD (faster_rcnn_R_50_FPN_sup1_run1.yaml)

I have a question about the learning rate decay.

It has total training step, 180k, but learning rate decay is performed at (179990, 179995).

Is it right? or should I change it to the proper value like 120k, 160k?

I even can't find the information about learning rate decay in paper..

ycliu93 commented 3 years ago

Hi @JongMokKim ,

Our model is trained with a constant learning rate (only with a warm-up learning rate in the beginning), and using (179990,179995) is just a trivial way to minimize the effect of learning rate decay.

We would like to share that we actually have tried to use learning rate decay (120k, 160k), and we found this stops the teacher model from further improving.

If you have found anything interesting in the new learning rate schedule, welcome to share your finding!

JongMokKim commented 3 years ago

Thank you for your clarification. I totally understand.

Divadi commented 3 years ago

@ycliu93 I had a followup question. Did you perchance find that teacher model performance declined when decreasing the learning rate, or did it just stagnate?

I feel that I am encountering such a situation, but am unable to find much reference for such an issue as all other works decay the learning rate (step or cosine).

ycliu93 commented 3 years ago

Hi @Divadi ,

I'm happy to share more of my experience.

First finding: The teacher will decline slightly, but it won't decline very much. Note that the Student can still perform better under learning rate decay.

Second finding: One potential reason why it degrades is that the EMA rate of the teacher model is too high (0.9996). I tried to decrease the EMA rate to 0.99 and use learning rate decay at 120k and 160k, the teacher will decline a bit when the learning rate just decayed. After few hundreds of iterations, the teacher could further improve (as the Student can perform better ).

Third finding: I also tried to use the EMA scheduler (just like the learning rate decay scheduler), but it cannot improve as well.

I only have these empirical findings, and I tried to find out some references but it seems that no one discussed this before.

ycliu93 commented 3 years ago

I reopened this issue for initiating a discussion about the learning rate scheduler. If anyone figures out a better way to doing the learning rate scheduler, welcome to share your experience. 👍

Divadi commented 3 years ago

Thank you for sharing your experience! Do you have any intuition for why the teacher performance might degrade even as student performance improves after LR decay?

I will try different EMA decay rate, albeit I am working with a different dataset & scheduler

hachreak commented 3 years ago

Do you have any intuition for why the teacher performance might degrade even as student performance improves after LR decay?

Actually, sometimes happen that student has a great improvement and teacher a small one, or vice versa. EMA sometimes has a strange behaviour, not really "linear". Somebody has the same experience?