NorbertZheng / read-papers

My paper reading notes.
MIT License
8 stars 0 forks source link

Sik-Ho Tang | Review -- Multi-task Self-Supervised Visual Learning. #128

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Multi-task Self-Supervised Visual Learning.

NorbertZheng commented 1 year ago

Overview

Pretrain Using Multiple Pretext Tasks to Improve Downstream Task Accuracy.

Multi-task Self-Supervised Visual Learning. Doersch ICCV’17, by DeepMind, and VGG, University of Oxford. 2017 ICCV, Over 400 Citations. Self-Supervised Learning, Representation Learning, Image Classification, Object Detection, Depth Prediction.

NorbertZheng commented 1 year ago

A joint loss that includes 4 task losses???

NorbertZheng commented 1 year ago

Multi-Task Network

image The structure of our multi-task network based on ResNet-101, with block 3 having 23 residual units. a) Naïve shared-trunk approach, where each “head” is attached to the output of block 3. b) the lasso architecture, where each “head” receives a linear combination of unit outputs within block3, weighted by the matrix , which is trained to be sparse.

Three architectures are described:

NorbertZheng commented 1 year ago

Common Trunk

Model:

One embedding for all tasks!!!

Each task has a separate loss, and has extra layers in a “head,” which may have a complicated structure.

Implementation:

4 Self-supervised tasks are used: Relative Position (Context Prediction), Colorization, Exemplar, and Motion Segmentation (Motion Masks).

NorbertZheng commented 1 year ago

One task one batch, average across all tasks to get an averaged gradient with lower variance!!!

NorbertZheng commented 1 year ago

Separating Features via Lasso

Idea:

If the features are factorized into different tasks, then the network can select from the discovered feature groups while training on the evaluation tasks.

Model:

One task involves as fewer embedding layers as possible.

The representation passed to the head for task n is then: image where $N$ is the number of self-supervised tasks, $M$ is the number of residual units in block 3, and $Unit_{m}$ is the output of residual unit $m$.

To ensure sparsity, an L1 penalty is added on the entries of $\alpha$ to the objective function. A similar $\alpha$ matrix is created for the set of evaluation tasks.

NorbertZheng commented 1 year ago

One task is a combination of different embedding layers (across different abstraction levels!!!)

NorbertZheng commented 1 year ago

Harmonizing Network Inputs

To “harmonize,” relative position (Context Prediction)’s preprocessing is replaced with the same preprocessing used for Colorization: images are converted to Lab, and the a and b channels are discarded (The L channel is replicated by 3 times).

NorbertZheng commented 1 year ago

Distributed Network Training

image Distributed training setup.

Training Setup:

Results:

64 GPUs are used in parallel, and checkpoints are saved every roughly 2.4K GPU (NVIDIA K40) hours.

NorbertZheng commented 1 year ago

Synchronizing within the same task, while ont synchronizing with other tasks.

NorbertZheng commented 1 year ago

Model Fine-Tuning

ImageNet

After self-supervised training, a single linear classification layer (a softmax) to the network at the end of block 3, and train on the full ImageNet training set.

PASCAL VOC 2007 Detection

Fast R-CNN is used, which trains a single network base with multiple heads (common trunk, more stable???) for object proposals, box classification, and box localization.

NYU V2 Depth Prediction

ResNet-50 is used. The block 3 outputs are directly fed into the up-projection layers. I.e. append other decoders after the pre-trained model.

NorbertZheng commented 1 year ago

Experimental Results

Individual Self-Supervised Training Performance

image Individual Self-Supervised Training Performance.

Results:

image Comparison of performance for different self-supervised methods over time.

NorbertZheng commented 1 year ago
NorbertZheng commented 1 year ago

Naïve Multi-Task Combination of Self-Supervision Tasks

image Comparison of various combinations of self-supervised tasks RP: Relative Position (Context Prediction); Col: Colorization; Ex: Exemplar Nets; MS: Motion Segmentation (Motion Masks). Metrics: ImageNet: Recall@5; PASCAL: mAP; NYU: % Pixels below 1.25.

NorbertZheng commented 1 year ago

More tasks!!!

NorbertZheng commented 1 year ago

Harmonization

image Comparison of methods with and without harmonization, H: harmonization.

NorbertZheng commented 1 year ago

Data formation matters (RP compared to RP/H on ImageNet), but only when the amount of the dataset is relatively small. When the amount of dataset is large, the gap will decrease (RP compared to RP/H on PASCAL/NYU)!!!

NorbertZheng commented 1 year ago

Lasso

image Comparison of performance with and without the lasso technique for factorizing representations, for a network trained on all four self-supervised tasks for 16.8K GPU-hours.

There are four cases: no lasso, lasso only on the evaluation tasks, lasso only at pre-training time, and lasso in both self-supervised training and evaluation.

NorbertZheng commented 1 year ago

The gap between ImageNet pre-trained and self-supervision pre-trained with four tasks is nearly closed for the VOC detection evaluation, and completely closed for NYU depth.

NorbertZheng commented 1 year ago

Reference