facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
9.24k stars 825 forks source link

Clarifications on High-Resolution Adaptation #211

Open JihwanEom opened 1 year ago

JihwanEom commented 1 year ago

Hello,

I'm want to clarify regarding the high-resolution adaptation described in the paper. As per section 4 and Appendix B.2, it's mentioned that the model was trained at a higher resolution (from 224 to 518) over 10k iterations. However, I couldn't find the related codes in this repository.

  1. Code Availability: Are the high-resolution adaptation code lines not included in the repository's release?
  2. Details on "compressed to fit": Could you tell me about the details about "compressed to fit"? (probably it may be the answer for third question)
  3. Batch Size & Learning Rate: It would be very helpful if you could provide the specifics regarding the batch size and learning rate used during this high-resolution adaptation phase.

Thank you in advance!

usryokousha commented 1 year ago

I think you can just use the existing code as is. You should notice that the crop size for the ViT-G variant is 518 which means that it is interpolating the positioning embedding during the training. You can just modify the configuration to 518 from 224 for the last 10k iterations. The included configs in this repository are for ImageNet1K and ImageNet22K, not for the bigger internal dataset in the paper. The schedulers will automatically compress the exponential decay / increase as a function of the number of steps.

zshn25 commented 9 months ago

Thank you @usryokousha for the clarification. Could you please point where the interpolation in the position embedding is happening? I cannot use the 224 model with 518 input or vise-versa.

I get the same error as #316 when I try to do that.

usryokousha commented 9 months ago

I am quite sure the distillation-based models are all trained with a base context length of (224 x 224) + 1.  It is only the ViT-Giant model that is trained with (518 x 518) + 1.  You should encounter a weight mismatch when loading other distillation-based models with a high resolution input.  (Assuming you are referring to fine-tuning)For pre-training you shouldn’t have a problem.  The interpolation of the position embedding is here:   https://github.com/facebookresearch/dinov2/blob/2302b6bf46953431b969155307b9bed152754069/dinov2/models/vision_transformer.py#L179나의 iPhone에서 보냄2024. 1. 30. 오후 6:35, Zeeshan Khan Suri @.***> 작성: Thank you @usryokousha for the clarification. Could you please point where the interpolation in the position embedding is happening? I cannot use the 224 model with 518 input or vise-versa

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

anadodik commented 8 months ago

+1 I would also like clarification on this if possible :)