MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.32k stars 133 forks source link

about inference speed? #1

Closed gy20073 closed 2 years ago

gy20073 commented 2 years ago

Thanks for the great work! But I have one question that was not discussed in the paper, which is the inference speed.

I understand that during training the speed is fast, since you masked out 90% of the patches. However, during test time, I assume you retain all the patches. Since the attention in ViT is quadratic in cost, does it mean you will need 10*10 = 100 times the flops compared to the pretrain phase?

If that is the case, how much time does it take to do a single forward pass?

yztongzhan commented 2 years ago

Hi @gy20073, thanks for your questions! All the patches are indeed retained during fine-tuning phase. However, much more epochs are needed during pre-training phase and the shallow decoder also retains all the patches. Our encoder is exactly the vanilla Vision Transformer and the Flops are all reported in Table 6 and Table 7 to compare with other methods.

gy20073 commented 2 years ago

Thanks! That is very helpful

Yang Gao


发件人: Zhan Tong @.> 发送时间: Saturday, March 26, 2022 4:44:04 PM 收件人: MCG-NJU/VideoMAE @.> 抄送: gy20073 @.>; Mention @.> 主题: Re: [MCG-NJU/VideoMAE] about inference speed? (Issue #1)

Hi @gy20073https://github.com/gy20073, thanks for your questions! All the patches are indeed retained during fine-tuning phase. However, much more epochs are needed during pre-training phase and the shallow decoder also retains all the patches. Our encoder is exactly the vanilla Vision Transformer and the Flops are all reported in Table 6 and Table 7 to compare with other methods.

― Reply to this email directly, view it on GitHubhttps://github.com/MCG-NJU/VideoMAE/issues/1#issuecomment-1079641894, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABN7NDFR2CDNNG6KQ444FWDVB3E5JANCNFSM5RWJ4H3A. You are receiving this because you were mentioned.Message ID: @.***>

gy20073 commented 2 years ago

sorry I am still confused about the test time FLOPs. The ViT-B/16 on a single 224*224 image should have 17.5G Flops (see swin transformer paper Table 1a). In VideoMAE, you used 8 frames, so that should be an increase of 64x, which is 1120Flops, however the paper says it is 180FLOPS in table 6. Can you correct me where is wrong?

gy20073 commented 2 years ago

sorry, should be GFlops

yztongzhan commented 2 years ago

Quadratic complexity only exists in space-time joint attention layer, which is not the computational bottleneck in Transformer when the length of input tokens isn't too long. The bottleneck mainly exists in FFN (MLP).

yztongzhan commented 2 years ago

Most operations are linear complexity except computing the pair-wise attention weights.

gy20073 commented 2 years ago

Thanks! My understanding was incorrect.