Some Questions about model distillation

MMY1994 commented 1 year ago

Thanks for your work! I have some questions about model distillation. "we leverage the same training loop with a few exceptions: we use a larger model as a frozen teacher, keep a spare EMA of the student that we use as our final model, remove the masking and stochastic depth, and, apply the iBOT loss on the two global crops." In Paper.

I can only get vit-g backbone pretrained model. "frozen teacher" means whether include "dino head" and "ibot head"?
what does "keep a spare EMA of the student" means? student model parameters are update with ema? student and teacher are not the same model.

usryokousha commented 1 year ago

If you look at the default training config for ViT-G, a separate head is used for iBOT (two heads: DINO head and iBOT head). The Frozen teacher should include both of these frozen heads since this is distillation and you want the joint embedding to not change.
Keep a spare EMA of the student is essentially creating a copy of the student and updating it by exponential moving average at a certain frequency. This can be updated in the same way that the teacher is updated in the training code dinov2/train/ssl_meta_arch.py.

Note: The distillation code was not included in this repository. You cannot use ssl_meta_arch.py to do distillation as is. You would need to modify it to include the student EMA, and to load different models for teacher and student. You would also need to create a method to update the EMA similar to the way the frozen teacher is updated (this deals with collecting all the gradients in FSDP). I have created a fork of this repository with some distillation code which can be found here

leng-yue commented 1 year ago

If you look at the default training config for ViT-G, a separate head is used for iBOT (two heads: DINO head and iBOT head). The Frozen teacher should include both of these frozen heads since this is distillation and you want the joint embedding to not change.

Keep a spare EMA of the student is essentially creating a copy of the student and updating it by exponential moving average at a certain frequency. This can be updated in the same way that the teacher is updated in the training code dinov2/train/ssl_meta_arch.py.

Note: The distillation code was not included in this repository. You cannot use ssl_meta_arch.py to do distillation as is. You would need to modify it to include the student EMA, and to load different models for teacher and student. You would also need to create a method to update the EMA similar to the way the frozen teacher is updated (this deals with collecting all the gradients in FSDP). I have created a fork of this repository with some distillation code which can be found here

Thanks for your great work! Did u reproduce their distillation result?

nemonameless commented 1 year ago

@usryokousha Thanks for your great work! Have you reproduced their distillation result?

MarioAvolio commented 7 months ago

@usryokousha your code has a little error when using copy.deepcopy on a ModuleDict with PyTorch models.

(<class 'RuntimeError'>, RuntimeError('Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001'), <traceback object at 0x7f11a4044480>)

self.student = nn.ModuleDict(student_model_dict) self.teacher = nn.ModuleDict(teacher_model_dict) self.student_shadow = copy.deepcopy(self.student) # This line causes the error

How we can fix?

ChenweiLyu commented 2 months ago

As for the bug: RuntimeError('Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment.

I refer this link: https://github.com/pytorch/pytorch/issues/102981

I change the dino_head.py: from torch.nn.utils import weight_norm -> from torch.nn.utils.parametrizations import weight_norm

This requires the torch version >=2.1.0. Besides, I comment the line self.last_layer.weightg.data.fill(1)

facebookresearch / dinov2

Some Questions about model distillation #114