chmxu / eTT_TMLR2022

20 stars 2 forks source link

pre-training dino #1

Closed jimmyrick closed 1 year ago

chmxu commented 2 years ago

You can simply select ImageNet data with class index of meta-train classes. This is provided in the dataset_spec.json from meta-dataset.

pyjnqd commented 1 year ago

Hi, could you provide pretrained models(84&224)for reproduce results convenience?Thanks a lot.

chmxu commented 1 year ago

You can download the pretrained weights here https://drive.google.com/drive/folders/1PEbH472Qr8ma9WGD6sB9wmBIfA05roKo?usp=share_link.

pyjnqd commented 1 year ago

Thank you!

pyjnqd commented 1 year ago

Excuse me, I find that ["pos_embed"] tensor shape is [1, 785, 192] in dino_vit_tiny pretrained model state_dict which you provide in google drive. It indicates this vit_dino_tiny is trained on 224224 input. But from the paper, vit_dino_tiny is trained on 8484 input. So, do you have the vit)dino_tiny trained on 84*84 input?

chmxu commented 1 year ago

Hi, I've just check that. In fact I did not include the inference script for ViT-tiny in the released version. If you are going to test ViT-tiny you have to change the data config from meta_dataset_config_vit.gin to meta_dataset_config_vit_tiny.gin and you can find that the code does use 84 84 images. As for the position embedding, I assume it will not have error on 84 84 images since the pos embs are interpolated before being attached to the patch embs. I did forget to tune the patch size when I used the DINO and eTT code and maybe setting patch size to 84 initially could lead to different performance. But I think this problem doesn't affect the conclusion in our paper and the efficacy of eTT. Hope this is helpful.

pyjnqd commented 1 year ago

I have agree with you. Maybe it will have different results if the dino_tiny_vit is trained on 84 input. I have tried and it achieved good results.