facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.35k stars 703 forks source link

Some Questions about model distillation #114

Open MMY1994 opened 1 year ago

MMY1994 commented 1 year ago

Thanks for your work! I have some questions about model distillation. "we leverage the same training loop with a few exceptions: we use a larger model as a frozen teacher, keep a spare EMA of the student that we use as our final model, remove the masking and stochastic depth, and, apply the iBOT loss on the two global crops." In Paper.

  1. I can only get vit-g backbone pretrained model. "frozen teacher" means whether include "dino head" and "ibot head"?
  2. what does "keep a spare EMA of the student" means? student model parameters are update with ema? student and teacher are not the same model.
usryokousha commented 1 year ago
  1. If you look at the default training config for ViT-G, a separate head is used for iBOT (two heads: DINO head and iBOT head). The Frozen teacher should include both of these frozen heads since this is distillation and you want the joint embedding to not change.
  2. Keep a spare EMA of the student is essentially creating a copy of the student and updating it by exponential moving average at a certain frequency. This can be updated in the same way that the teacher is updated in the training code dinov2/train/ssl_meta_arch.py.

Note: The distillation code was not included in this repository. You cannot use ssl_meta_arch.py to do distillation as is. You would need to modify it to include the student EMA, and to load different models for teacher and student. You would also need to create a method to update the EMA similar to the way the frozen teacher is updated (this deals with collecting all the gradients in FSDP). I have created a fork of this repository with some distillation code which can be found here

leng-yue commented 1 year ago
  1. If you look at the default training config for ViT-G, a separate head is used for iBOT (two heads: DINO head and iBOT head). The Frozen teacher should include both of these frozen heads since this is distillation and you want the joint embedding to not change.
  2. Keep a spare EMA of the student is essentially creating a copy of the student and updating it by exponential moving average at a certain frequency. This can be updated in the same way that the teacher is updated in the training code dinov2/train/ssl_meta_arch.py.

Note: The distillation code was not included in this repository. You cannot use ssl_meta_arch.py to do distillation as is. You would need to modify it to include the student EMA, and to load different models for teacher and student. You would also need to create a method to update the EMA similar to the way the frozen teacher is updated (this deals with collecting all the gradients in FSDP). I have created a fork of this repository with some distillation code which can be found here

Thanks for your great work! Did u reproduce their distillation result?

nemonameless commented 1 year ago

@usryokousha Thanks for your great work! Have you reproduced their distillation result?

MarioAvolio commented 7 months ago

@usryokousha your code has a little error when using copy.deepcopy on a ModuleDict with PyTorch models.

(<class 'RuntimeError'>, RuntimeError('Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment. If you were attempting to deepcopy a module, this may be because of a torch.nn.utils.weight_norm usage, see https://github.com/pytorch/pytorch/pull/103001'), <traceback object at 0x7f11a4044480>)

self.student = nn.ModuleDict(student_model_dict) self.teacher = nn.ModuleDict(teacher_model_dict) self.student_shadow = copy.deepcopy(self.student) # This line causes the error

How we can fix?

ChenweiLyu commented 2 months ago

As for the bug: RuntimeError('Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment.

I refer this link: https://github.com/pytorch/pytorch/issues/102981

I change the dino_head.py: from torch.nn.utils import weight_norm -> from torch.nn.utils.parametrizations import weight_norm

This requires the torch version >=2.1.0. Besides, I comment the line self.last_layer.weightg.data.fill(1)