In this function (loadvideo_decord), the function samples frames from the video using the clip length and the frame_sample rate.
The beginning of the clip is randomized. Lets say for simplicity that the first frame is 0.
Also, assume the clip length is 4 and the frame_sample_rate is 6.
I expect to get frames 0,6,12,18.
However, I get frames 0,8,16,24, which means the effective frame_sample_rate is 8!
Here, np.diff(index) returns array([4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4]), because the code attempts to get 16 frames from a range of 64 frames; whereas it should really get it from 60 frames.
I suggest fixing this by changing the line
converted_len = int(self.clip_len * self.frame_sample_rate)
to converted_len = int((self.clip_len-1) * self.frame_sample_rate)
This is at the very core of VideoMAE. Please correct me if I'm wrong or misunderstood something.
In this function (loadvideo_decord), the function samples frames from the video using the clip length and the frame_sample rate. The beginning of the clip is randomized. Lets say for simplicity that the first frame is 0. Also, assume the clip length is 4 and the frame_sample_rate is 6. I expect to get frames 0,6,12,18. However, I get frames 0,8,16,24, which means the effective frame_sample_rate is 8!
https://github.com/MCG-NJU/VideoMAE/blob/14ef8d856287c94ef1f985fe30f958eb4ec2c55d/kinetics.py#L222 This also happens for the more "conventional" example of frame_sample_rate = 4 and clip_len=16, as used in the script for vit_large.
Here,
np.diff(index)
returnsarray([4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4])
, because the code attempts to get 16 frames from a range of 64 frames; whereas it should really get it from 60 frames. I suggest fixing this by changing the lineconverted_len = int(self.clip_len * self.frame_sample_rate)
toconverted_len = int((self.clip_len-1) * self.frame_sample_rate)
This is at the very core of VideoMAE. Please correct me if I'm wrong or misunderstood something.