Closed sudhakaranjain closed 1 year ago
Hi @sudhakaranjain, I'm not sure if I understand your issue, but according to the official implementation the EMA must be the deepcopy of the student. https://github.com/facebookresearch/fairseq/blob/16538a0bff1b9f32e89aa915f2e8b57193f33109/examples/data2vec/models/data2vec_text.py#L346 https://github.com/facebookresearch/fairseq/blob/16538a0bff1b9f32e89aa915f2e8b57193f33109/fairseq/modules/ema_module.py#L41
Sorry for the confusion. You are right!
EMA teacher model, according to the paper, is initialized randomly with the same architecture as student model. So, deepcopying the student model to create the teacher model should be avoided as it copies the weight parameters as well.