Closed catherine-qian closed 3 years ago
visual: spatial feature extracted by ResNet152 from 80 frames -> 80x2048 visual_st: spatio-temporal feature extracted by R2Plus1D from 10 8-frame clips -> 10x512
the scripts can be found in https://github.com/YapengTian/AVVP-ECCV20/tree/master/scripts
thanks for your explaination!
Dear author,
Thanks a lot for your contribution!
One query about the code implementation: For nets/net_audiovisual -> class MMIL_Net()
why it has three inputs: (1)audio, (2)visual, (3)visual_st with the feature dimension of ([16, 10, 128]), ([16, 80, 2048]), ([16, 10, 512]) Could you please explain what's the differences between visual and visual_st? And whats are their last two dimensions mean?
Thanks in advance for your help!