Open ahmed-nady opened 1 week ago
Hi @ahmed-nady, thank you for your interest!
Thanks, @dominickrei, for your swift response. According to TimeSFormer paper, they used 1 temporal clip x 3 spatial crops. So, as you mentioned , the GFLOPs of TimeSFormer are 196.7 x 3 = 590 GFLOPs. But in your config file, e.g., PIViT_Smarthome.yaml, you specified the number of spatial crops to be 1. As a result, the #GFLOPs of the used TimeSFormer per clip, i think it should be 196.7. Hence the #GFLOPs of adopted TimeSFormer at inference is 196.7 x 10 = 1967 GFLOPs on Toyota-Smarthome dataset (8 RGB input frames) and 196.7x2x10= 3934 GFLOPs on NTU datasets (number of input frames 16) since you use 10 clips per video, right?
Hi @ahmed-nady. From what I recall the TimeSformer paper does not compute FLOPs in the way you are saying, they are computed by a single forward pass on a single sample independent of temporal clips and spatial crops (i.e., the input shape is 1 x 3 x 224 x 224
and not spatial*temporal crops x 3 x 224 x 224
).
But you are correct that at inference we use more FLOPs than the base setting of TimeSformer. However, when we report results on TimeSformer we use the same test settings that we do for pi-vit.
Thanks, @dominickrei for sharing your code. I am a bit confused about the number of clips per video you use in the NTU and SmartHome datasets. Are you using 10 clips per video? since the config file (PIViT_Smarthome.yaml) you specified NUM_ENSEMBLE_VIEWS: 10?
Also, the GFlOPs for one clip for the NTU dataset during the test are 196.7, since you use pose-augmented RGB, right?
Thanks in advance.