MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.39k stars 137 forks source link

BUG: Incorrect temporal indexing? #97

Open rosenfeldamir opened 1 year ago

rosenfeldamir commented 1 year ago

In this function (loadvideo_decord), the function samples frames from the video using the clip length and the frame_sample rate. The beginning of the clip is randomized. Lets say for simplicity that the first frame is 0. Also, assume the clip length is 4 and the frame_sample_rate is 6. I expect to get frames 0,6,12,18. However, I get frames 0,8,16,24, which means the effective frame_sample_rate is 8!

https://github.com/MCG-NJU/VideoMAE/blob/14ef8d856287c94ef1f985fe30f958eb4ec2c55d/kinetics.py#L222 This also happens for the more "conventional" example of frame_sample_rate = 4 and clip_len=16, as used in the script for vit_large.

Here, np.diff(index) returns array([4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4]), because the code attempts to get 16 frames from a range of 64 frames; whereas it should really get it from 60 frames. I suggest fixing this by changing the line converted_len = int(self.clip_len * self.frame_sample_rate) to converted_len = int((self.clip_len-1) * self.frame_sample_rate) This is at the very core of VideoMAE. Please correct me if I'm wrong or misunderstood something.