Closed NorbertZheng closed 1 year ago
Pretrain Using Multiple Pretext Tasks to Improve Downstream Task Accuracy.
Multi-task Self-Supervised Visual Learning. Doersch ICCV’17, by DeepMind, and VGG, University of Oxford. 2017 ICCV, Over 400 Citations. Self-Supervised Learning, Representation Learning, Image Classification, Object Detection, Depth Prediction.
A joint loss that includes 4 task losses???
The structure of our multi-task network based on ResNet-101, with block 3 having 23 residual units. a) Naïve shared-trunk approach, where each “head” is attached to the output of block 3. b) the lasso architecture, where each “head” receives a linear combination of unit outputs within block3, weighted by the matrix , which is trained to be sparse.
Three architectures are described:
Model:
One embedding for all tasks!!!
Each task has a separate loss, and has extra layers in a “head,” which may have a complicated structure.
Implementation:
4 Self-supervised tasks are used: Relative Position (Context Prediction), Colorization, Exemplar, and Motion Segmentation (Motion Masks).
One task one batch, average across all tasks to get an averaged gradient with lower variance!!!
Idea:
If the features are factorized into different tasks, then the network can select from the discovered feature groups while training on the evaluation tasks.
Model:
One task involves as fewer embedding layers as possible.
The representation passed to the head for task n is then: where $N$ is the number of self-supervised tasks, $M$ is the number of residual units in block 3, and $Unit_{m}$ is the output of residual unit $m$.
To ensure sparsity, an L1 penalty is added on the entries of $\alpha$ to the objective function. A similar $\alpha$ matrix is created for the set of evaluation tasks.
One task is a combination of different embedding layers (across different abstraction levels!!!)
To “harmonize,” relative position (Context Prediction)’s preprocessing is replaced with the same preprocessing used for Colorization: images are converted to Lab, and the a and b channels are discarded (The L channel is replicated by 3 times).
Distributed training setup.
Training Setup:
Results:
64 GPUs are used in parallel, and checkpoints are saved every roughly 2.4K GPU (NVIDIA K40) hours.
Synchronizing within the same task, while ont synchronizing with other tasks.
After self-supervised training, a single linear classification layer (a softmax) to the network at the end of block 3, and train on the full ImageNet training set.
Fast R-CNN is used, which trains a single network base with multiple heads (common trunk, more stable???) for object proposals, box classification, and box localization.
ResNet-50 is used. The block 3 outputs are directly fed into the up-projection layers. I.e. append other decoders after the pre-trained model.
Individual Self-Supervised Training Performance.
Results:
Comparison of performance for different self-supervised methods over time.
Comparison of various combinations of self-supervised tasks RP: Relative Position (Context Prediction); Col: Colorization; Ex: Exemplar Nets; MS: Motion Segmentation (Motion Masks). Metrics: ImageNet: Recall@5; PASCAL: mAP; NYU: % Pixels below 1.25.
More tasks!!!
Comparison of methods with and without harmonization, H: harmonization.
Data formation matters (RP compared to RP/H on ImageNet), but only when the amount of the dataset is relatively small. When the amount of dataset is large, the gap will decrease (RP compared to RP/H on PASCAL/NYU)!!!
Comparison of performance with and without the lasso technique for factorizing representations, for a network trained on all four self-supervised tasks for 16.8K GPU-hours.
There are four cases: no lasso, lasso only on the evaluation tasks, lasso only at pre-training time, and lasso in both self-supervised training and evaluation.
The gap between ImageNet pre-trained and self-supervision pre-trained with four tasks is nearly closed for the VOC detection evaluation, and completely closed for NYU depth.
Sik-Ho Tang. Review — Multi-task Self-Supervised Visual Learning.