facebookresearch / hiera

Hiera: A fast, powerful, and simple hierarchical vision transformer.
Apache License 2.0
870 stars 39 forks source link

In the inference.ipynb, each video is extracted to a length of 5s, but how to deal with if the video sizes are different and some videos are short than 5s? #15

Closed sulizhi closed 1 year ago

dbolya commented 1 year ago

The basic unit for the video models is "16x4" or we sample 16 frames, each 4 frames apart. Since we trained with 30fps video, that means that you need at least 64 frames @ 30 fps or 2.133s of video to run the model. If you don't have that many seconds of video, perhaps you can duplicate frames? Or sample with 2 frames of gap instead, etc. (but I can't guarantee accuracy then).

Then for clips longer than 2.133 seconds, just do as I did in the notebook: pass multiple clips into the model and average the results. So in the notebook example, I takes 5s of video and extract the first 128 frames (or first 4.266 seconds). If you have a longer video, just increase the transcoding duration and sample more clips.

sulizhi commented 1 year ago

The problem has been solved! Thank you very much!