Alpha-VL / ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders
MIT License
484 stars 41 forks source link

output of FastconvMAE #14

Open lywang76 opened 2 years ago

lywang76 commented 2 years ago

I used your fastconvmae train imgnet data.

In your code, you said the output should be:

image However, when i used the pretrained model to predict, it gave me prediction size= torch.Size([4, 196, 768]). I also tested MAE mode, it can give prediction in torch.Size([1, 196, 768]).

Can you explain it why?

stoneMo commented 2 years ago

Thanks for your interest in our FastConvMAE. Motivated by the information density in MAE, we fixed the group size to 4. Each group covers 25% of tokens and reconstructs the features of the other 75% features from the other groups. That is, one forward for FastConvMAE is four forwards for the original ConvMAE.

Similarly, if you want to achieve FastMAE, you need carefully check the input (output from the encoder) and mask token for the decoder. Also please check the shape of ids_restore in your implementation. Hope this helps you understand the complementary masking better. The prediction size of the fast version should be [4*bs, 196, 768].

Please feel free to contact us if you have any further questions. If possible, you can open the issue in the FastConvMAE repo, which would be better for later readers.

lywang76 commented 2 years ago

Thanks!