Query regarding the output adapter heads

AntiLibrary5 commented 2 years ago

Hi, Thank you for the interesting work and the extensive experiments. Your depth results are based on the DPT head in the paper. In the colab, you use the spatial adapter head for inference. I was wondering if your fine-tuning results with the spatial adapter head were better/worse than the DPT head? Was the intention to implement this spatial head more to test a pure transformer based head (compared to DPT's convolution based refineNet like approach?)?

Thank you.

roman-bachmann commented 2 years ago

Hi!

The Colab notebook is mainly intended to visualize the pre-training objective and to demonstrate cross-modal interaction, so we show predictions using the spatial adapter head. There are several reasons why we didn’t use DPT heads during pre-training:

The DPT head reshapes the full N_HxN_W set of tokens into a dense feature map (for multiple layers). This works during fine-tuning, since we use as input entire RGB images, but during pre-training, where we use only 98 randomly sampled tokens from 3 modalities, this reshaping would not work.
By using cross-attention in the decoder, we can nicely integrate the extracted information from tokens of all other input modalities, no matter how many there are.
All in all, we tried to keep the decoders as conceptually simple and lightweight as possible for efficient pre-training. The original MAE paper had some experiments showing that better reconstruction quality during pre-training does not necessarily result in better transfers, so we chose to go with a shallow and simple decoder, and not add any conv layers.

For all our depth fine-tuning runs, we discard the pre-trained spatial adapter head and add a DPT head instead. We do this because, as you might have noticed, the predictions of the spatial adapter head produce some “patch artifacts” that get more noticeable the more out of distribution we get in terms of number of input tokens used (e.g. when using the full set of 196 RGB input tokens instead of just 98 as during pre-training). We therefore never fine-tuned the pre-trained spatial adapter head, but given that it seems to predict depth pretty well in the first place, this would be something to try in the future.

Best, Roman

DianCh commented 1 year ago

Hi @roman-bachmann ! May I ask which experiment in the MAE paper you are referring to that shows "better reconstruction quality does not necessarily result in better transfers"?

roman-bachmann commented 1 year ago

Hi @DianCh ,

I was mostly talking about a perceptual notion of reconstruction quality and referring to Table 1 d) in the MAE paper, which shows that predicting pixels using an MSE loss is just as good at fine-tuning as using dVAE tokens as targets. The former produces blurry outputs, while predicting tokens may result in visually more pleasing results. That said, as PeCo shows, using a tokenizer trained with a perceptual loss can perform better downstream.

Best, Roman

EPFL-VILAB / MultiMAE

Query regarding the output adapter heads #5