Alibaba-MIIL / STAM

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)
Apache License 2.0
219 stars 31 forks source link

39.8% of the validation data is not used for performance test #7

Closed FaceAnalysis closed 3 years ago

FaceAnalysis commented 3 years ago

Hi researchers. Great work for getting rid of multi-view inference. Some problems in my experiment: Many recent methods use non-local copies of Kinetics-400 dataset for experiments since more and more YouTube videos are unavailable. While using validation set of non-local copies and torchvion.datasets.Kinetics400 API(in src/utils/utils.py) for loading clips, there are around 39% of the validation data is discarded. In my experiment, top1 accuracy is the same as STAM_16 shows but fewer data is used. Print valid_data.len() at utils.py and it should show there are around 11897 clips if using non-local copies(19761 total). I believe STAM using one clip per video as the paper described. It seems that torchvion.datasets.Kinetics400 API discards same videos due to params settings. I also change the extensions('avi', 'mp4') to extensions('avi','mp4','mkv','webm') to cover all format, but still 11.5% discarded. So could you explain more about your experiment settings like details about dataset source (Kinetics official download links or non-local copies), how many samples in validation set and list of validation file names) or make your validation data public if convenient. Thank you.

giladsharir commented 3 years ago

Hi, The issue you raised is an issue with the torchvision.Kinetics400 reader. Since the frames_per_clip and frame_rate args are set, some videos that don't have sufficient frames at the required frame-rate will be skipped by the Kinetics400 dataset reader. In our internal version, I fixed this by modifying the torchvision.Kinetics400 code, to prevent it from skipping videos. I can provide this code .

mmazeika commented 2 years ago

Hello,

Could you provide the modified torchvision.Kinetics400 code? Was the modified code used for training STAM, or did you use the standard Kinetics400 dataset (i.e., a VideoClips object with frame_rate set to 1.6)?