The VideoMAE ViT-H and VideoMAE ViT-S pre-trained kinetics weights seem to have a problem. When loading the weights of other pre-trained models like ViT-L or ViT-B, the state_dict contains the weights for the decoder layers. But this is not true for the ViT-H and ViT-S. As a result, it is not possible to load it into an encoder/decoder setup.
How to reproduce
To reproduce, just download the weights and load the state_dict. Comparing it to the other pre-trained weights you can see the decoder weights are missing.
Problem
The
VideoMAE ViT-H
andVideoMAE ViT-S
pre-trained kinetics weights seem to have a problem. When loading the weights of other pre-trained models likeViT-L
orViT-B
, the state_dict contains the weights for the decoder layers. But this is not true for theViT-H
andViT-S
. As a result, it is not possible to load it into an encoder/decoder setup.How to reproduce
To reproduce, just download the weights and load the state_dict. Comparing it to the other pre-trained weights you can see the decoder weights are missing.
The state_dict is very large, so I don't include the output here.