CuriousAI / mean-teacher

A state-of-the-art semi-supervised method for image recognition
https://arxiv.org/abs/1703.01780
Other
1.56k stars 331 forks source link

About EMA #28

Open YilinLiu97 opened 5 years ago

YilinLiu97 commented 5 years ago

Hi, I found that the teacher model's weights seem to be not updated as it performed as bad as it was first initialized.

alpha = min(1 - 1 / (global_step + 1), alpha) for ema_param, param in zip(ema_model.parameters(), model.parameters()): ema_param.data.mul_(alpha).add_(1 - alpha, param.data)

Shouldn't this be ema_param.data.mul_(alpha).add_((1 - alpha)*param.data) ?

here are the parameters printed out during training: ('teacher_p: ', Parameter containing: tensor([ 0.0007, -0.0006, 0.0046, -0.0033, 0.0004, 0.0262, 0.0153, -0.0259, -0.0115, -0.0015, -0.0117, -0.0060, 0.0161, 0.0104, 0.0080, -0.0015, -0.0116, -0.0160, 0.0247, -0.0227, 0.0077, 0.0052, 0.0217, 0.0111, -0.0036, -0.0176, -0.0188, 0.0026, -0.0163, 0.0155], device='cuda:0')) ('student_p: ', Parameter containing: tensor([-0.0322, -0.0153, 0.0206, -0.0212, -0.0274, 0.0293, 0.0225, -0.0279, -0.0272, -0.0282, -0.0272, -0.0261, 0.0275, 0.0261, 0.0274, -0.0251, 0.0014, -0.0285, 0.0296, -0.0296, 0.0105, -0.0209, 0.0123, 0.0227, -0.0162, -0.0081, -0.0079, -0.0233, -0.0145, 0.0030], device='cuda:0', requires_grad=True)) ('(after) teacher_p: ', Parameter containing: tensor([ 0.0007, -0.0006, 0.0046, -0.0033, 0.0004, 0.0262, 0.0153, -0.0259, -0.0115, -0.0016, -0.0117, -0.0060, 0.0161, 0.0104, 0.0080, -0.0015, -0.0116, -0.0160, 0.0247, -0.0227, 0.0077, 0.0052, 0.0217, 0.0111, -0.0036, -0.0176, -0.0187, 0.0026, -0.0163, 0.0155], device='cuda:0'))

YilinLiu97 commented 5 years ago

Is the implementation wrong?

tarvaina commented 5 years ago

Those two add_ lines are equivalent, aren’t they?

https://pytorch.org/docs/stable/torch.html#torch.add

XiaoYunZhou27 commented 3 years ago

I realise that alpha is 0 at the beginning, as alpha = min(1 - 1 / (global_step + 1), 0.9), hence no update on the teacher at the beginning. The code is different from what is stated in the paper. A correct coding for the paper should be alpha = max(1 - 1 / (global_step + 1), 0.9)