Question about sparse sampling

OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models

https://arxiv.org/abs/2303.16058

MIT License

267 stars 13 forks source link

Question about sparse sampling #17

Closed yuu2704 closed 8 months ago

yuu2704 commented 9 months ago

Thank you for sharing great work.

I understand that sparse sampling means sampling N frames from the entire video at equal intervals. Am I correct in understanding that in this case, even for relatively long video datasets such as ActivityNet-QA and ActivityNet Captions, only 8 to 16 frames from the entire video are sampled and used, just like any other dataset? Sorry to ask something so elementary.

Thanks again for your great work.

Andy1621 commented 8 months ago

Sorry for the late response. When training, sparse sampling will split the video into N splits, and randomly sample one frame for each split. When testing, it works as you say.

As for the other ActivityNet, yes, we use 12 frames as other datasets. I have tried to use more frames but not work.

yuu2704 commented 8 months ago

Thank you for your response. It is very interesting to note that a 3-minute video can be retrieved with only 12 frames, and that it does not work well with more frames.