Closed gy20073 closed 2 years ago
Hi @gy20073, thanks for your questions! All the patches are indeed retained during fine-tuning phase. However, much more epochs are needed during pre-training phase and the shallow decoder also retains all the patches. Our encoder is exactly the vanilla Vision Transformer and the Flops are all reported in Table 6 and Table 7 to compare with other methods.
Thanks! That is very helpful
Yang Gao
发件人: Zhan Tong @.> 发送时间: Saturday, March 26, 2022 4:44:04 PM 收件人: MCG-NJU/VideoMAE @.> 抄送: gy20073 @.>; Mention @.> 主题: Re: [MCG-NJU/VideoMAE] about inference speed? (Issue #1)
Hi @gy20073https://github.com/gy20073, thanks for your questions! All the patches are indeed retained during fine-tuning phase. However, much more epochs are needed during pre-training phase and the shallow decoder also retains all the patches. Our encoder is exactly the vanilla Vision Transformer and the Flops are all reported in Table 6 and Table 7 to compare with other methods.
― Reply to this email directly, view it on GitHubhttps://github.com/MCG-NJU/VideoMAE/issues/1#issuecomment-1079641894, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABN7NDFR2CDNNG6KQ444FWDVB3E5JANCNFSM5RWJ4H3A. You are receiving this because you were mentioned.Message ID: @.***>
sorry I am still confused about the test time FLOPs. The ViT-B/16 on a single 224*224 image should have 17.5G Flops (see swin transformer paper Table 1a). In VideoMAE, you used 8 frames, so that should be an increase of 64x, which is 1120Flops, however the paper says it is 180FLOPS in table 6. Can you correct me where is wrong?
sorry, should be GFlops
Quadratic complexity only exists in space-time joint attention layer, which is not the computational bottleneck in Transformer when the length of input tokens isn't too long. The bottleneck mainly exists in FFN (MLP).
Most operations are linear complexity except computing the pair-wise attention weights.
Thanks! My understanding was incorrect.
Thanks for the great work! But I have one question that was not discussed in the paper, which is the inference speed.
I understand that during training the speed is fast, since you masked out 90% of the patches. However, during test time, I assume you retain all the patches. Since the attention in ViT is quadratic in cost, does it mean you will need 10*10 = 100 times the flops compared to the pretrain phase?
If that is the case, how much time does it take to do a single forward pass?