cvondrick / soundnet

SoundNet: Learning Sound Representations from Unlabeled Video. NIPS 2016
http://projects.csail.mit.edu/soundnet/
MIT License
462 stars 94 forks source link

A question about the output of visual CNN. #13

Open pangwenfeng opened 6 years ago

pangwenfeng commented 6 years ago

Hi, thanks for your nice paper. I met a question that in your paper you say the numbers of frames of the videos are variable. So how do you fuse the CNN output from different frames so the length of last output is a constant? Just computing the average or something else? Thank you very much.