facebookresearch / omnivore

Omnivore: A Single Model for Many Visual Modalities
Other
559 stars 39 forks source link

Inconsistency in frame sampling between paper and YAML file for inference_k400_in1k_pretrained.yaml #38

Closed ofikodar closed 1 year ago

ofikodar commented 1 year ago

In the paper "At test time, we again sample a 32 frame clip with stride 2", but in the YAML file, the following settings are used for sampling and temporal cropping:

frame_sampler: _target_: pytorchvideo.transforms.UniformTemporalSubsample num_samples: 160 and - _target_: omnivision.data.transforms.pytorchvideo.TemporalCrop frames_per_clip: 32 stride: 40

These settings seem to be inconsistent with what was reported in the paper. Can the authors please clarify if this is a mistake or if there is a reason for this discrepancy?

rohitgirdhar commented 1 year ago

Hi @ofikodar The stride in the 2nd part (TemporalCrop) refers to the stride of the temporal cropping, not the stride of frames. When we sample 160 frames from a 10s video, we are effectively sampling 160 frames at a stride of 2 (since a 10s video will have a total of 320 frames, assuming a typical frame rate of 32 FPS). We then split those 160 frames into shorter clips of 32 frames each (which have been sampled at a stride of 2).

ofikodar commented 1 year ago

Thanks for your help!