Closed NorbertZheng closed 1 year ago
Self-Supervised Learning Using $L^{3}$-Net for Audio-Visual Correspondence Task (AVC). Audio-visual correspondence task (AVC): By seeing and hearing many unlabelled examples, a network should learn to determine whether a pair of (video frame, short audio clip) correspond to each other or not.
In this story, Look, Listen and Learn, ($L^{3}$-Net), by DeepMind, and VGG in University of Oxford, is reviewed. In this paper, a question is considered:
Audio-Visual Correspondence (AVC) learning task is introduced to training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, result in good visual and audio representations.
This is a paper in 2017 ICCV with over 400 citations.
By seeing and hearing many examples of a person playing a violin and examples of a dog barking, and never, or at least very infrequently, seeing a violin being played while hearing a dog bark and vice versa, it should be possible to
The AVC task is a simple binary classification task: given an example video frame and a short audio clip — decide whether they correspond to each other or not.
The corresponding (positive) pairs are the ones that are taken at the same time from the same video, while mismatched (negative) pairs are extracted from different videos.
$L^{3}$-Net: Network Architecture.
The network has three distinct parts:
Maybe we should convert MEG/EEG signals into a log-spectrogram? Improve the performance?
Audio involves time information, while frame is just an image!!!
The network was trained on 16 GPUs in parallel with synchronous training implemented in TensorFlow, where each worker processed a 16-element batch, thus making the effective batch size of 256.
For a training set of 400k 10 second videos, the network is trained for two days, during which it has seen 60M frame-audio pairs.
Two video datasets are used for training the networks: Flickr-SoundNet and Kinetics-Sounds.
This is a large unlabelled dataset of completely unconstrained videos from Flickr.
This dataset is used for the transfer learning experiments.
This is a a labelled dataset for quantitative evaluation. A subset (much smaller than Flickr-SoundNet) of the Kinetics dataset, is used, which contains YouTube videos manually annotated for 10-sceond cropped human actions.
The subset contains 19k 10 second video clips (15k training, 1.9k validation, 1.9k test) formed by filtering the Kinetics dataset for 34 human action classes, such as:
It still contains considerable noise, e.g.: the bowling action is often accompanied by loud music at the bowling alley, human voices (camera operators or video narrations) often masks the sound of interest, and many videos contain sound tracks.
Audio-visual correspondence (AVC) results.
Results:
Architecture:
Model:
The supervised baselines do not beat the $L^{3}$-Net as “supervised pretraining” performs on par with it, while “supervised direct combination” works significantly worse as, unlike “supervised pretraining”, it has not been trained for the AVC task.
After self-supervised training in the above AVC experiment, the subnetwork should be well pretrained, and can be
Sound Classification.
Dataset:
Representation:
Results:
The proposed $L^{3}$-training (Ours) sets the new state-of-the-art by a large margin on both benchmarks.
Visual classification on ImageNet.
Representation:
Model:
Results:
It is impressive that the proposed visual features L³-Net-trained on Flickr videos perform on par with self-supervised state-of-the-art trained on ImageNet.
Seems like what we are doing with THINGS-MEG:
Learnt visual concepts.
The above figure shows the images that activate particular units in pool4 the most (i.e. are ranked highest by its magnitude).
Visual semantic heatmap.
Learnt audio concepts.
The above images display the video frame that corresponds to the sound.
Audio semantic heatmaps.
The above figure shows spectrograms and their semantic heatmaps.
Sik-Ho Tang. Review — Look, Listen and Learn (Self-Supervised Learning).