YapengTian / AVVP-ECCV20

Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing, ECCV, 2020. (Spotlight)
77 stars 20 forks source link

question about visual feature dimension #8

Closed catherine-qian closed 3 years ago

catherine-qian commented 3 years ago

Dear author,

Thanks a lot for your contribution!

One query about the code implementation: For nets/net_audiovisual -> class MMIL_Net()

why it has three inputs: (1)audio, (2)visual, (3)visual_st with the feature dimension of ([16, 10, 128]), ([16, 80, 2048]), ([16, 10, 512]) Could you please explain what's the differences between visual and visual_st? And whats are their last two dimensions mean?

Thanks in advance for your help!

YapengTian commented 3 years ago

visual: spatial feature extracted by ResNet152 from 80 frames -> 80x2048 visual_st: spatio-temporal feature extracted by R2Plus1D from 10 8-frame clips -> 10x512

the scripts can be found in https://github.com/YapengTian/AVVP-ECCV20/tree/master/scripts

catherine-qian commented 3 years ago

thanks for your explaination!