NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- Multi-task Self-Supervised Visual Learning. #128

Closed NorbertZheng closed 11 months ago

NorbertZheng commented 11 months ago

Sik-Ho Tang. Review — Multi-task Self-Supervised Visual Learning.

NorbertZheng commented 11 months ago

Overview

Pretrain Using Multiple Pretext Tasks to Improve Downstream Task Accuracy.

Multi-task Self-Supervised Visual Learning. Doersch ICCV’17, by DeepMind, and VGG, University of Oxford. 2017 ICCV, Over 400 Citations. Self-Supervised Learning, Representation Learning, Image Classification, Object Detection, Depth Prediction.

NorbertZheng commented 11 months ago

A joint loss that includes 4 task losses???

NorbertZheng commented 11 months ago

Multi-Task Network

image The structure of our multi-task network based on ResNet-101, with block 3 having 23 residual units. a) Naïve shared-trunk approach, where each “head” is attached to the output of block 3. b) the lasso architecture, where each “head” receives a linear combination of unit outputs within block3, weighted by the matrix , which is trained to be sparse.

Three architectures are described:

NorbertZheng commented 11 months ago

Common Trunk

Model:

One embedding for all tasks!!!

Each task has a separate loss, and has extra layers in a “head,” which may have a complicated structure.

Implementation:

4 Self-supervised tasks are used: Relative Position (Context Prediction), Colorization, Exemplar, and Motion Segmentation (Motion Masks).

NorbertZheng commented 11 months ago

One task one batch, average across all tasks to get an averaged gradient with lower variance!!!

NorbertZheng commented 11 months ago

Separating Features via Lasso

Idea:

If the features are factorized into different tasks, then the network can select from the discovered feature groups while training on the evaluation tasks.

Model:

One task involves as fewer embedding layers as possible.

The representation passed to the head for task n is then: image where $N$ is the number of self-supervised tasks, $M$ is the number of residual units in block 3, and $Unit_{m}$ is the output of residual unit $m$.

To ensure sparsity, an L1 penalty is added on the entries of $\alpha$ to the objective function. A similar $\alpha$ matrix is created for the set of evaluation tasks.

NorbertZheng commented 11 months ago

One task is a combination of different embedding layers (across different abstraction levels!!!)

NorbertZheng commented 11 months ago

Harmonizing Network Inputs

To “harmonize,” relative position (Context Prediction)’s preprocessing is replaced with the same preprocessing used for Colorization: images are converted to Lab, and the a and b channels are discarded (The L channel is replicated by 3 times).

NorbertZheng commented 11 months ago

Distributed Network Training

image Distributed training setup.

Training Setup:

Results:

64 GPUs are used in parallel, and checkpoints are saved every roughly 2.4K GPU (NVIDIA K40) hours.

NorbertZheng commented 11 months ago

Synchronizing within the same task, while ont synchronizing with other tasks.

NorbertZheng commented 11 months ago

Model Fine-Tuning

ImageNet

After self-supervised training, a single linear classification layer (a softmax) to the network at the end of block 3, and train on the full ImageNet training set.

PASCAL VOC 2007 Detection

Fast R-CNN is used, which trains a single network base with multiple heads (common trunk, more stable???) for object proposals, box classification, and box localization.

NYU V2 Depth Prediction

ResNet-50 is used. The block 3 outputs are directly fed into the up-projection layers. I.e. append other decoders after the pre-trained model.

NorbertZheng commented 11 months ago

Experimental Results

Individual Self-Supervised Training Performance

image Individual Self-Supervised Training Performance.

Results:

image Comparison of performance for different self-supervised methods over time.

NorbertZheng commented 11 months ago
NorbertZheng commented 11 months ago

Naïve Multi-Task Combination of Self-Supervision Tasks

image Comparison of various combinations of self-supervised tasks RP: Relative Position (Context Prediction); Col: Colorization; Ex: Exemplar Nets; MS: Motion Segmentation (Motion Masks). Metrics: ImageNet: Recall@5; PASCAL: mAP; NYU: % Pixels below 1.25.

NorbertZheng commented 11 months ago

More tasks!!!

NorbertZheng commented 11 months ago

Harmonization

image Comparison of methods with and without harmonization, H: harmonization.

NorbertZheng commented 11 months ago

Data formation matters (RP compared to RP/H on ImageNet), but only when the amount of the dataset is relatively small. When the amount of dataset is large, the gap will decrease (RP compared to RP/H on PASCAL/NYU)!!!

NorbertZheng commented 11 months ago

Lasso

image Comparison of performance with and without the lasso technique for factorizing representations, for a network trained on all four self-supervised tasks for 16.8K GPU-hours.

There are four cases: no lasso, lasso only on the evaluation tasks, lasso only at pre-training time, and lasso in both self-supervised training and evaluation.

NorbertZheng commented 11 months ago

The gap between ImageNet pre-trained and self-supervision pre-trained with four tasks is nearly closed for the VOC detection evaluation, and completely closed for NYU depth.

NorbertZheng commented 11 months ago

Reference