In the inference.ipynb, each video is extracted to a length of 5s, but how to deal with if the video sizes are different and some videos are short than 5s?

The basic unit for the video models is "16x4" or we sample 16 frames, each 4 frames apart. Since we trained with 30fps video, that means that you need at least 64 frames @ 30 fps or 2.133s of video to run the model. If you don't have that many seconds of video, perhaps you can duplicate frames? Or sample with 2 frames of gap instead, etc. (but I can't guarantee accuracy then).

Then for clips longer than 2.133 seconds, just do as I did in the notebook: pass multiple clips into the model and average the results. So in the notebook example, I takes 5s of video and extract the first 128 frames (or first 4.266 seconds). If you have a longer video, just increase the transcoding duration and sample more clips.

facebookresearch / hiera

In the inference.ipynb, each video is extracted to a length of 5s, but how to deal with if the video sizes are different and some videos are short than 5s? #15