Question on ablation study

Hi authors, thank you for your awesome work.

I was going through the VFIformer paper and I got curious of something from the ablation study. It would have been great to attend CVPR and ask in person, but unfortunately I cannot do so, so I leave my question here.

In short, to my understanding, the main contribution of the paper is: use of transformer layers in VFI, with a novel cross-scale window attention, reaching a state-of-the-art performance.

So I assume the 'Model 1' configuration of Table 2 consists of Convolutional layers only, yet it still outperforms (36.27 in Vimeo90k) the best baseline (36.18). I came to wonder the reason for this. To me, the 'Model 1' configuration did not seem to have anything special (no offense) since it did not contain the proposed modules.

Can you give an explanation on this? What was the difference that lead to a strong base model (Model 1)? Or did I miss anything on the 'Model 1' configuration...?

dvlab-research / VFIformer

Question on ablation study #2