NorbertZheng / read-papers

My paper reading notes.
MIT License
7 stars 0 forks source link

Sik-Ho Tang | Review -- Unsupervised Learning of Visual Representations using Videos. #121

Closed NorbertZheng closed 1 year ago

NorbertZheng commented 1 year ago

Sik-Ho Tang. Review — Unsupervised Learning of Visual Representations using Videos.

NorbertZheng commented 1 year ago

Overview

image Unsupervised Tracking in Videos

Unsupervised Learning of Visual Representations using Videos. Wang ICCV’15, by Robotics Institute, Carnegie Mellon University. 2015 ICCV, Over 900 Citations. Contrastive Learning, Unsupervised Learning, Object Detection.

Authors start the paper by asking about the questions:

In this paper,

NorbertZheng commented 1 year ago

Approach Overview

image Approach Overview

NorbertZheng commented 1 year ago

Triplet Contrastive Learning!!!

NorbertZheng commented 1 year ago

Patch Mining in Videos

image Patch Mining in Videos

Before training, we need to mine the patches. A two-step approach is used.

In the first step, SURF [1] interest points are obtained and Improved Dense Trajectories (IDT) [50] is used to obtain motion of each SURF point, since IDT applies a homography estimation (video stabilization) method, it reduces the problem caused by camera motion.

Given the trajectories of SURF interest points, these points are classified as moving if the flow magnitude is more than 0.5 pixels.

Frames are rejected if

In the second step, the best bounding box is found such that it contains most of the moving SURF points. The size of the bounding box is set as h×w (227×227 in a frame of 448×600). In the second step, a sliding window is used within the frame to find the bounding box that contains the most number of moving SURF interest points.

This patch acts as the similar patch to the query patch in the triplet. image Examples of patch pairs we obtain via patch mining in the videos

Finally, millions of pairs are generated, as shown above.

Three different networks are trained separately using 1.5M, 5M and 8M training samples.

NorbertZheng commented 1 year ago

Siamese Triplet Network

image Siamese Triplet Network

Network Architecture

A Siamese-triplet network consist of three base networks which share the same parameters.

Thus the final output of each single network is 1024 dimensional feature space f().

NorbertZheng commented 1 year ago

Ranking Loss Function

image Top response regions for the pool5 neurons of our unsupervised-CNN.

Given an image $X$ as an input for the network, we can obtain its feature in the final layer as $f(X)$. Then, the distance of two image patches $X{1}$, $X{2}$ is defined based on the cosine distance in the feature space as: image

Formally, $X{i}$ is the original query patch (first patch in tracked frames), $X{i}^{+}$ is the tracked patch and $X_{i}^{-}$ is a random patch from a different video. To enforce: image

where $M=0.5$ represents the gap parameters between two distances.

The objective function is: image where $N$ is the number of the triplets of samples. $\lambda$ is a constant representing weight decay, which is set to $\lambda=0.0005$.

NorbertZheng commented 1 year ago

Hard Negative Mining for Triplet Sampling

After 10 epochs of training using negative data selected randomly, the negative patch is selected such that the loss is maximum.

Specifically, for each pair $\{X{i},X{i}^{+}\}$, the loss of all other negative patches in batch $B=100$ is calculated, and the top $K=4$ negative patches with highest losses are selected.

NorbertZheng commented 1 year ago

Model Fine-Tuning

image After fine-tuning on PASCAL VOC 2012, these filters become quite strong

Straightforward Way

One straight forward approach is directly applying the ranking model as a pre-trained network for the target task. For the fully connected layers, they are initialized randomly.

Iterative Fine-Tuning Scheme

Here, transfer convolutional parameters are transferred for re-adapting.

After two iterations of this approach, the network converges.

Model Ensemble

There are billions of videos in YouTube, this opens up the possibility of training multiple CNNs using different sets of data.

Once these CNNs are trained, the fc7 features are appended from each of these CNNs to train the final SVM.

NorbertZheng commented 1 year ago

Experimental Results

image mean Average Precision (mAP) on VOC 2012. “external” column shows the number of patches used to pre-train unsupervised-CNN.

Single Model

The detection pipeline introduced in R-CNN is followed where the CNNs

The fine-tuned CNN was then used to extract features followed by training SVMs for each object class.

As a baseline, the network is trained from scratch on VOC 2012 dataset and obtain 44% mAP.

Using the proposed unsupervised network pre-trained with 1.5M pair of patches and then fine-tuned on VOC 2012, mAP of 46.2% is obtained (unsup+ft, external data = 1.5M).

However, using more data, 5M and 8M patches in pre-training and then fine-tune, 47% and 47.5% mAP are achieved.

NorbertZheng commented 1 year ago

Model Ensemble

By ensembling two fine-tuned networks which are pre-trained using 1.5M and 5M patches, we obtained a boost of 3.5% comparing to the single model, which is 50.5% (unsup+ft (2 ensemble)).

Finally, all three different networks pre-trained with different sets of data, whose size are 1.5M, 5M and 8M respectively, are ensembled. Another boost is obtained with 52% mAP (unsup+ft (3 ensemble)).

NorbertZheng commented 1 year ago

ImageNet Pretrained Model

It is noted that ImageNet dataset is a set of labelled data. When ImageNet pre-trained model is used, 50.1% mAP (RCNN 70K) is obtained. The result of ensembling two of these networks is 53.6% mAP (RCNN 70K (2 ensemble)). If three networks are ensembled, a mAP of 54.4% is obtained.

NorbertZheng commented 1 year ago

Pre-train needs multi-task!!!

NorbertZheng commented 1 year ago

Iterative Fine-Tuning Scheme

Given the proposed fine-tuned model using 5M patches in pre-training (unsup+ft, external = 5M), it is re-learnt and re-adapted to the unsupervised triplet task. After that, the network is re-applied to fine-tune on VOC 2012. The final result for this single model is 48% mAP (unsup + iterative ft), which is 1% better than the initial fine-tuned network.

NorbertZheng commented 1 year ago

Unsupervised Network Without Fine-Tuning

mAP of 26.1% is obtained using the proposed unsupervised network (training with 8M data).

The ensemble of two unsupervised-network (training with 5M and 8M data) gets mAP of 28.2%.

As a comparison, ImageNet pretrained network without fine-tuning gets mAP of 40.4%.

NorbertZheng commented 1 year ago

We need fine-tune to obtain better representation!!!

NorbertZheng commented 1 year ago

The successful implementation opens up a new space for designing unsupervised learning algorithms for CNN training. (There are also results for surface normal estimation, please feel free to read the paper if interested.)

NorbertZheng commented 1 year ago

Reference