Open JihwanEom opened 1 year ago
I think you can just use the existing code as is. You should notice that the crop size for the ViT-G variant is 518 which means that it is interpolating the positioning embedding during the training. You can just modify the configuration to 518 from 224 for the last 10k iterations. The included configs in this repository are for ImageNet1K and ImageNet22K, not for the bigger internal dataset in the paper. The schedulers will automatically compress the exponential decay / increase as a function of the number of steps.
Thank you @usryokousha for the clarification. Could you please point where the interpolation in the position embedding is happening? I cannot use the 224 model with 518 input or vise-versa.
I get the same error as #316 when I try to do that.
I am quite sure the distillation-based models are all trained with a base context length of (224 x 224) + 1. It is only the ViT-Giant model that is trained with (518 x 518) + 1. You should encounter a weight mismatch when loading other distillation-based models with a high resolution input. (Assuming you are referring to fine-tuning)For pre-training you shouldn’t have a problem. The interpolation of the position embedding is here: https://github.com/facebookresearch/dinov2/blob/2302b6bf46953431b969155307b9bed152754069/dinov2/models/vision_transformer.py#L179나의 iPhone에서 보냄2024. 1. 30. 오후 6:35, Zeeshan Khan Suri @.***> 작성: Thank you @usryokousha for the clarification. Could you please point where the interpolation in the position embedding is happening? I cannot use the 224 model with 518 input or vise-versa
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
+1 I would also like clarification on this if possible :)
Hello,
I'm want to clarify regarding the high-resolution adaptation described in the paper. As per section 4 and Appendix B.2, it's mentioned that the model was trained at a higher resolution (from 224 to 518) over 10k iterations. However, I couldn't find the related codes in this repository.
Section 4 states:
Appendix B.2 mentions:
Thank you in advance!