EPFL-VILAB / MultiMAE

MultiMAE: Multi-modal Multi-task Masked Autoencoders, ECCV 2022
https://multimae.epfl.ch
Other
544 stars 58 forks source link

Query regarding the output adapter heads #5

Closed AntiLibrary5 closed 2 years ago

AntiLibrary5 commented 2 years ago

Hi, Thank you for the interesting work and the extensive experiments. Your depth results are based on the DPT head in the paper. In the colab, you use the spatial adapter head for inference. I was wondering if your fine-tuning results with the spatial adapter head were better/worse than the DPT head? Was the intention to implement this spatial head more to test a pure transformer based head (compared to DPT's convolution based refineNet like approach?)?

Thank you.

roman-bachmann commented 2 years ago

Hi!

The Colab notebook is mainly intended to visualize the pre-training objective and to demonstrate cross-modal interaction, so we show predictions using the spatial adapter head. There are several reasons why we didn’t use DPT heads during pre-training:

For all our depth fine-tuning runs, we discard the pre-trained spatial adapter head and add a DPT head instead. We do this because, as you might have noticed, the predictions of the spatial adapter head produce some “patch artifacts” that get more noticeable the more out of distribution we get in terms of number of input tokens used (e.g. when using the full set of 196 RGB input tokens instead of just 98 as during pre-training). We therefore never fine-tuned the pre-trained spatial adapter head, but given that it seems to predict depth pretty well in the first place, this would be something to try in the future.

Best, Roman

DianCh commented 1 year ago

Hi @roman-bachmann ! May I ask which experiment in the MAE paper you are referring to that shows "better reconstruction quality does not necessarily result in better transfers"?

roman-bachmann commented 1 year ago

Hi @DianCh ,

I was mostly talking about a perceptual notion of reconstruction quality and referring to Table 1 d) in the MAE paper, which shows that predicting pixels using an MSE loss is just as good at fine-tuning as using dVAE tokens as targets. The former produces blurry outputs, while predicting tokens may result in visually more pleasing results. That said, as PeCo shows, using a tokenizer trained with a perceptual loss can perform better downstream.

Best, Roman