Closed eliahuhorwitz closed 2 years ago
Hi!
Thank you, we are glad you like the project! We ran the IN-1K linear probing experiment on MultiMAE (pre-trained on all 3 modalities) and MAE baselines. All models use a ViT-B backbone and were pre-trained for 1600 epochs. The linear probes were trained using the code and settings of the official MAE codebase.
Since linear probing accuracy can depend strongly on the depth of the layer that is probed, we compare MultiMAE to an MAE with the same decoder depth 2 and dimension 256:
For reference, we also trained a linear probe on MAEs with 8 decoder layers and dimension 512:
MultiMAE pre-training sees a substantial increase in performance compared to the MAE of similar decoder settings (MAE-D2-256). Given the large difference in linear probing performance on MAEs with different decoder depth and dimensions, it stands to reason that MultiMAE linear probing accuracy could further be improved by using a deeper/wider decoder or probing at different encoder layers.
Best, Roman
Thanks for the prompt response and thank you for running the test. The results of MultiMAE do seem encouraging! However, I am not sure I fully understand what you have done. Could you please elaborate on:
Thanks again, Eliahu
We do note that when fine-tuning on IN-1K, MultiMAE and the two PyTorch MAE models (of decoder depths 2 and 8) all reach the same top-1 accuracy of 83.3%.
Thanks for the detailed explanation and apologies on the confusion, it all makes sense now.
Hey, Thank you for providing the code for the paper. The paper is really interesting and the project page is very well done!
I was wondering whether you've tested the performance of linear probing on the RGB image when trained with all 3 modalities. The results of the original MAE paper were not very good, it is interesting to understand if the additional supervision creates better representations that translate into better linear probing scores.
Thanks, Eliahu