facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.36k stars 907 forks source link

custom data normalisation for non-imagenet datasets #59

Open psteinb opened 3 years ago

psteinb commented 3 years ago

Hi, thanks to the authors of this paper and this code for making the effort to share their work with the community.

I am trying to use Dino on a non-imagenet dataset and started to alter the code in this fashion. For details, see main_dino.py and visualize_attention.py in my fork. I am basically trying to get rid of any hard coded magic numbers related to imagenet (if possible).

Drop me a :+1: if you like or need this work. If the feedback is inline with #1, I can send a PR if time permits. Other feedback on this is always welcome - feel free to send PRs to my fork.

woctezuma commented 3 years ago

For the rest of us, these are the differences: https://github.com/facebookresearch/dino/commit/afc323572b372e72bcb574c549013689f7b6a6b3 (and similarly https://github.com/facebookresearch/dino/commit/0a3b1b823a4a2f397e8bcc87ffdd6687df634f61)

mathildecaron31 commented 3 years ago

Hi @psteinb, do feel free to send a PR !

amandalucasp commented 2 years ago

Hi! Thanks for sharing your code. In order to run fine tuning using a pretrained model (as mentioned in https://github.com/facebookresearch/dino/issues/80), do you think it makes sense to change these values as well?

psteinb commented 2 years ago

@amandalucasp not sure what you mean. Do you want to use a pretrained model on task A but using DINO on task B?

amandalucasp commented 2 years ago

I am trying to use one of the provided pretrained models and fine tune it on a custom dataset. My question is regarding whether the modifications you pointed make sense for fine tuning (considering the pretrained model I'm using was initially trained on image-net); or if I should think about changing these values only if training from scratch.

psteinb commented 2 years ago

Ok, makes sense. In that case, my PR #63 would definitely help. If task A was classification on imagenet, it used the standard 'zscore' normalisation per color channel. but if you'd like to run DINO for task B, i.e. also on a different dataset, then a normalisation fit to this task B dataset is essential to use. Otherwise, the incoming images might be treated in an illposed fashion simply because they are not normalized correctly.

amandalucasp commented 2 years ago

I really feel like normalization maybe the bottleneck to improve the performance I'm having. Thank you for the feedback!

Harry-KIT commented 2 years ago

Hi @psteinb. Thanks for sharing nice tips. it works well. But one problem which i am facing after whole training is evaluation. if I tried to use checkpoints to run visualize_attention.py and video_generation.py, it showed the following errors.

RuntimeError: Error(s) in loading state_dict for VisionTransformer: size mismatch for pos_embed: copying a param with shape torch.Size([1, 197, 384]) from checkpoint, the shape in current model is torch.Size([1, 785, 384]). size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([384, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([384, 3, 8, 8]).

any suggestion is appreciated!!! thank you~

woctezuma commented 2 years ago

Maybe create a new issue?

mathildecaron31 commented 2 years ago

Hi @Harry-KIT

Reading your error message it seems that your are trying to load a vit-small/16 into a vit-small/8. Can you try adding the flag --patch_size 16 when running visualize_attention.py and video_generation.py ?

Harry-KIT commented 2 years ago

Hi @mathildecaron31 Thank you! done

mathildecaron31 commented 2 years ago

Has it solved your issue @Harry-KIT ?

Harry-KIT commented 2 years ago

Hi @mathildecaron31. Yes i did. You were right. I changed patch_size from 8 to 16. And it works!