EPFL-VILAB / MultiMAE

MultiMAE: Multi-modal Multi-task Masked Autoencoders, ECCV 2022
https://multimae.epfl.ch
Other
533 stars 61 forks source link

Linear probing results #1

Closed eliahuhorwitz closed 2 years ago

eliahuhorwitz commented 2 years ago

Hey, Thank you for providing the code for the paper. The paper is really interesting and the project page is very well done!

I was wondering whether you've tested the performance of linear probing on the RGB image when trained with all 3 modalities. The results of the original MAE paper were not very good, it is interesting to understand if the additional supervision creates better representations that translate into better linear probing scores.

Thanks, Eliahu

roman-bachmann commented 2 years ago

Hi!

Thank you, we are glad you like the project! We ran the IN-1K linear probing experiment on MultiMAE (pre-trained on all 3 modalities) and MAE baselines. All models use a ViT-B backbone and were pre-trained for 1600 epochs. The linear probes were trained using the code and settings of the official MAE codebase.

Since linear probing accuracy can depend strongly on the depth of the layer that is probed, we compare MultiMAE to an MAE with the same decoder depth 2 and dimension 256:

For reference, we also trained a linear probe on MAEs with 8 decoder layers and dimension 512:

MultiMAE pre-training sees a substantial increase in performance compared to the MAE of similar decoder settings (MAE-D2-256). Given the large difference in linear probing performance on MAEs with different decoder depth and dimensions, it stands to reason that MultiMAE linear probing accuracy could further be improved by using a deeper/wider decoder or probing at different encoder layers.

Best, Roman

eliahuhorwitz commented 2 years ago

Thanks for the prompt response and thank you for running the test. The results of MultiMAE do seem encouraging! However, I am not sure I fully understand what you have done. Could you please elaborate on:

  1. Why is the decoder relevant here and not the encoder?
  2. What does the dimension stand for? Is that the latent dimension of the encoder or decoder and why did you use 256? As far as I can see it should be 768 (for the encoder)
  3. Also, could you please explain why you choose to use a depth of 2 and not the one used by MAE (which as far as I could tell is 12 for the encoder?)

Thanks again, Eliahu

roman-bachmann commented 2 years ago
  1. Both encoder and decoder depth are relevant for linear probing. As shown in my above reply and also in the MAE paper, there are large differences in linear probing accuracy depending on the decoder depth. It is hypothesized that these differences stem from the fact that the last layers specialize to solve a pixel-wise reconstruction task and are not as relevant for recognition than earlier layers that may process more global information. A network with 12+8 layers might therefore have features better suited for linear probing at the end of the encoder than a network with 12+2 layers.
  2. The dimensions specified in my reply above are the latent dimensions of each token in the decoder(s). For the encoder we do indeed always use 768, just as in a standard ViT-B. For the decoders we chose 256, both because the MAE paper showed that there is only a very small difference in IN-1K fine-tuning accuracy between different widths (Tab. 1b), and to keep the computational cost low.
  3. We also use a depth of 12 for the encoder. Regarding the decoder depth, we follow the same reasoning as above: The MAE paper showed that the IN-1K fine-tuning accuracy does not vary much between different choices of decoder depth (Tab 1a). Choosing a smaller decoder depth also significantly reduces the computational cost, since we perform three dense reconstruction tasks.

We do note that when fine-tuning on IN-1K, MultiMAE and the two PyTorch MAE models (of decoder depths 2 and 8) all reach the same top-1 accuracy of 83.3%.

eliahuhorwitz commented 2 years ago

Thanks for the detailed explanation and apologies on the confusion, it all makes sense now.