MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.39k stars 137 forks source link

VideoMAE ViT-H pre-train does not contain the decoder weights #89

Open sandstorm12 opened 1 year ago

sandstorm12 commented 1 year ago

Problem

The VideoMAE ViT-H and VideoMAE ViT-S pre-trained kinetics weights seem to have a problem. When loading the weights of other pre-trained models like ViT-L or ViT-B, the state_dict contains the weights for the decoder layers. But this is not true for the ViT-H and ViT-S. As a result, it is not possible to load it into an encoder/decoder setup.

How to reproduce

To reproduce, just download the weights and load the state_dict. Comparing it to the other pre-trained weights you can see the decoder weights are missing.

URL = "https://drive.google.com/file/d/1AJQR1Rsi2N1pDn9tLyJ8DQrUREiBA1bO/view?usp=sharing"
output_name = "checkpoint.pth"
gdown.cached_download(URL, output_name)

state_dict = torch.load(output_name)
print(state_dict["module"])

The state_dict is very large, so I don't include the output here.

innat commented 1 year ago

The link of pretrain VideoMAE ViT-H is wrong sort of. It has only the encoder part.

innat commented 1 year ago

cc. @yztongzhan