Open riyaj8888 opened 3 years ago
we apply a small number of 3D convolution and pooling operations to the video stream, reducing its temporal sampling rate by a factor of 4. We also apply a series of strided 1D convolutions to the input waveform, until its sampling rate matches that of the video network. We fuse the two subnetworks by concatenating their activations channel-wise, after spatially tiling the audio activations.
I don't understand this part from paper please assist in this
can anyone briefly explain how the audio and video features are fused together ? please use above image as reference which is from org paper