andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
http://andrewowens.com/multisensory/
Apache License 2.0
220 stars 60 forks source link

model architecture #45

Open riyaj8888 opened 3 years ago

riyaj8888 commented 3 years ago

can anyone briefly explain how the audio and video features are fused together avts ? please use above image as reference which is from org paper

riyaj8888 commented 3 years ago

we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations.

I don't understand this part from paper please assist in this