dominickrei / pi-vit

[CVPR 2024] Code and models for pi-ViT, a video transformer for understanding activities of daily living
Other
11 stars 1 forks source link

Number of clips per video #4

Open ahmed-nady opened 1 week ago

ahmed-nady commented 1 week ago

Thanks, @dominickrei for sharing your code. I am a bit confused about the number of clips per video you use in the NTU and SmartHome datasets. Are you using 10 clips per video? since the config file (PIViT_Smarthome.yaml) you specified NUM_ENSEMBLE_VIEWS: 10?

Also, the GFlOPs for one clip for the NTU dataset during the test are 196.7, since you use pose-augmented RGB, right?

Thanks in advance.

dominickrei commented 1 week ago

Hi @ahmed-nady, thank you for your interest!

  1. You are correct that the number of clips per video at inference is 10
  2. At inference the # GLOPs is identical to TimeSformer, so 590 GFLOPs
ahmed-nady commented 1 week ago

Thanks, @dominickrei, for your swift response. According to TimeSFormer paper, they used 1 temporal clip x 3 spatial crops. So, as you mentioned , the GFLOPs of TimeSFormer are 196.7 x 3 = 590 GFLOPs. But in your config file, e.g., PIViT_Smarthome.yaml, you specified the number of spatial crops to be 1. As a result, the #GFLOPs of the used TimeSFormer per clip, i think it should be 196.7. Hence the #GFLOPs of adopted TimeSFormer at inference is 196.7 x 10 = 1967 GFLOPs on Toyota-Smarthome dataset (8 RGB input frames) and 196.7x2x10= 3934 GFLOPs on NTU datasets (number of input frames 16) since you use 10 clips per video, right?

dominickrei commented 1 week ago

Hi @ahmed-nady. From what I recall the TimeSformer paper does not compute FLOPs in the way you are saying, they are computed by a single forward pass on a single sample independent of temporal clips and spatial crops (i.e., the input shape is 1 x 3 x 224 x 224 and not spatial*temporal crops x 3 x 224 x 224).

But you are correct that at inference we use more FLOPs than the base setting of TimeSformer. However, when we report results on TimeSformer we use the same test settings that we do for pi-vit.