dvlab-research / VFIformer

Video Frame Interpolation with Transformer (CVPR2022)
MIT License
114 stars 20 forks source link

Question on ablation study #2

Open JHLew opened 2 years ago

JHLew commented 2 years ago

Hi authors, thank you for your awesome work.

I was going through the VFIformer paper and I got curious of something from the ablation study. It would have been great to attend CVPR and ask in person, but unfortunately I cannot do so, so I leave my question here.

In short, to my understanding, the main contribution of the paper is: use of transformer layers in VFI, with a novel cross-scale window attention, reaching a state-of-the-art performance.

So I assume the 'Model 1' configuration of Table 2 consists of Convolutional layers only, yet it still outperforms (36.27 in Vimeo90k) the best baseline (36.18). I came to wonder the reason for this. To me, the 'Model 1' configuration did not seem to have anything special (no offense) since it did not contain the proposed modules.

Can you give an explanation on this? What was the difference that lead to a strong base model (Model 1)? Or did I miss anything on the 'Model 1' configuration...?

SkyeLu commented 2 years ago

Hi, thanks for your interest in our work. As mentioned in the appendix of our paper, the main difference of Model 1 and the best baseline model is the flow estimator with the proposed Bilateral Local Refinement Blocks (BLRBs in Fig.9 b), which in fact bring about 0.1 dB improvement. But we do not claim BLRB as one of our key contributions, because when the model is equipped with transformer layers, the contribution of BLRBs is limited.