Sik-Ho Tang | Brief Review -- Distilling Visual Priors from Self-Supervised Learning.

NorbertZheng / read-papers

My paper reading notes.

MIT License

8 stars 0 forks source link

Sik-Ho Tang | Brief Review -- Distilling Visual Priors from Self-Supervised Learning. #143

Closed NorbertZheng closed 10 months ago

NorbertZheng commented 10 months ago

Sik-Ho Tang. Brief Review — Distilling Visual Priors from Self-Supervised Learning.

NorbertZheng commented 10 months ago

Overview

Using MoCo v2 as Teacher, Knowledge Distillation for Student, in VIPriors Challenge.

VIPriors Challenge (Image from 2020 ECCV Workshop VIPriors Challenge).

Distilling Visual Priors from Self-Supervised Learning MoCo v2+Distillation, by Tongji University, and Megvii Research Nanjing 2020 ECCV Workshop VIPriors Challenge

This is a paper participating in the “Visual Inductive Priors for Data-Efficient Computer Vision” Challenges (VIPriors) in 2020 ECCV Workshop, which is under the data-deficient scenario.
The first phase is to learn a teacher model using MoCo v2.
The second phase is to distill the representations into a student model in a self-Distillation manner.

NorbertZheng commented 10 months ago

Proposed Framework

Proposed Framework.

There are 2 phases. Phase-1 for teacher and Phase-2 for student.

NorbertZheng commented 10 months ago

Phase 1: Teacher

MoCo v2 is used to train the backbone in self-supervised manner for 800 epochs (best model obtained from MoCo v2).
The original loss used by MoCo v2 is:
MoCo v2 uses a momentum encoder $\theta{k}$ to encode all the $k$ and put them in a queue for negative sampling. The momentum encoder is a momentum average of the encoder $\theta{q}$:

In a data-deficient dataset, the maximum size of the queue is limited, authors propose to

add a margin to the original loss function to help the model obtain a larger margin between data samples thus help the model obtain a similar result with fewer negative examples: queue size is 4096.

NorbertZheng commented 10 months ago

Phase-2: Self-Distillation on Labeled Dataset

The Distillation process can be seen as a regulation to

prevent the student from overfitting the small train dataset,
and give the student a more diverse representation for classification.

Following OFD, the Distillation loss is: where distance metric $d_{p}$ is l2-distance in this paper.

Along with a cross-entropy loss for classification:

The final loss function for the student model is: $\lambda=10^{-4}$. 100 epochs are used for fine-tuning.

NorbertZheng commented 10 months ago

Results

Dataset

A small subset of ImageNet is used as the VIPrior challenge dataset.

There are still 1,000 classes but 50 images for each class in each train/val/test split, resulting in a total of 150,000 images.

Performance

Proposed Framework.

ResNet-50 is used as backbone.

Finally, by combining phase-1 and phase-2 together, the proposed pipeline achieves 16.7 performance gain in top-1 accuracy over the supervised baseline.

Linear Classifier.

The proposed margin loss is less sensitive to the number negatives and can be used in a data-deficient setting.

Bag of Tricks.

Several other tricks and stronger backbone models are used for better performance: Larger Resolution, AutoAugment, ResNeXt-101, label-smooth (Inception-v3), 10-Crop, and 2-model ensemble.

NorbertZheng commented 10 months ago

It's worth noting: distillation > full fine-tune > linear probing > supervised training.
Margin contrastive loss achieves better result with smaller batch size.
Tricks!!!

NorbertZheng commented 10 months ago

Reference

[2020 arXiv] [MoCo v2+Distillation] Distilling Visual Priors from Self-Supervised Learning.