NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- Look, Listen and Learn (Self-Supervised Learning). #125

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Look, Listen and Learn (Self-Supervised Learning).

NorbertZheng commented 1 year ago

Overview

Self-Supervised Learning Using $L^{3}$-Net for Audio-Visual Correspondence Task (AVC). image Audio-visual correspondence task (AVC): By seeing and hearing many unlabelled examples, a network should learn to determine whether a pair of (video frame, short audio clip) correspond to each other or not.

In this story, Look, Listen and Learn, ($L^{3}$-Net), by DeepMind, and VGG in University of Oxford, is reviewed. In this paper, a question is considered:

Audio-Visual Correspondence (AVC) learning task is introduced to training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, result in good visual and audio representations.

This is a paper in 2017 ICCV with over 400 citations.

NorbertZheng commented 1 year ago

Core Idea

Binary Classification Task

By seeing and hearing many examples of a person playing a violin and examples of a dog barking, and never, or at least very infrequently, seeing a violin being played while hearing a dog bark and vice versa, it should be possible to

The AVC task is a simple binary classification task: given an example video frame and a short audio clip — decide whether they correspond to each other or not.

NorbertZheng commented 1 year ago

Difficulties

The corresponding (positive) pairs are the ones that are taken at the same time from the same video, while mismatched (negative) pairs are extracted from different videos.

NorbertZheng commented 1 year ago

$L^{3}$-Net: Network Architecture

image $L^{3}$-Net: Network Architecture.

The network has three distinct parts:

Vision Subnetwork

NorbertZheng commented 1 year ago

Audio Subnetwork

NorbertZheng commented 1 year ago

Maybe we should convert MEG/EEG signals into a log-spectrogram? Improve the performance?

NorbertZheng commented 1 year ago

Fusion Network

NorbertZheng commented 1 year ago

Training Data Sampling & Datasets

Training Data Sampling & Other Details

Audio involves time information, while frame is just an image!!!

The network was trained on 16 GPUs in parallel with synchronous training implemented in TensorFlow, where each worker processed a 16-element batch, thus making the effective batch size of 256.

For a training set of 400k 10 second videos, the network is trained for two days, during which it has seen 60M frame-audio pairs.

NorbertZheng commented 1 year ago

Datasets

Two video datasets are used for training the networks: Flickr-SoundNet and Kinetics-Sounds.

Flickr-SoundNet

This is a large unlabelled dataset of completely unconstrained videos from Flickr.

This dataset is used for the transfer learning experiments.

Kinetics-Sounds

This is a a labelled dataset for quantitative evaluation. A subset (much smaller than Flickr-SoundNet) of the Kinetics dataset, is used, which contains YouTube videos manually annotated for 10-sceond cropped human actions.

The subset contains 19k 10 second video clips (15k training, 1.9k validation, 1.9k test) formed by filtering the Kinetics dataset for 34 human action classes, such as:

It still contains considerable noise, e.g.: the bowling action is often accompanied by loud music at the bowling alley, human voices (camera operators or video narrations) often masks the sound of interest, and many videos contain sound tracks.

NorbertZheng commented 1 year ago

Audio-Visual Correspondence (AVC) Results

image Audio-visual correspondence (AVC) results.

Results:

Architecture:

Model:

The supervised baselines do not beat the $L^{3}$-Net as “supervised pretraining” performs on par with it, while “supervised direct combination” works significantly worse as, unlike “supervised pretraining”, it has not been trained for the AVC task.

NorbertZheng commented 1 year ago

Transfer Learning Results

After self-supervised training in the above AVC experiment, the subnetwork should be well pretrained, and can be

Audio Features on ESC-50 & DCASE

image Sound Classification.

Dataset:

Representation:

Results:

The proposed $L^{3}$-training (Ours) sets the new state-of-the-art by a large margin on both benchmarks.

NorbertZheng commented 1 year ago

Video Features on ImageNet

image Visual classification on ImageNet.

Representation:

Model:

Results:

It is impressive that the proposed visual features L³-Net-trained on Flickr videos perform on par with self-supervised state-of-the-art trained on ImageNet.

NorbertZheng commented 1 year ago

Seems like what we are doing with THINGS-MEG:

NorbertZheng commented 1 year ago

Qualitative Results

Visual Features

image Learnt visual concepts.

The above figure shows the images that activate particular units in pool4 the most (i.e. are ranked highest by its magnitude).

image Visual semantic heatmap.

NorbertZheng commented 1 year ago

Audio Features

image Learnt audio concepts.

The above images display the video frame that corresponds to the sound.

image Audio semantic heatmaps.

The above figure shows spectrograms and their semantic heatmaps.

NorbertZheng commented 1 year ago

Reference