Closed NorbertZheng closed 1 year ago
Unsupervised Tracking in Videos
Unsupervised Learning of Visual Representations using Videos. Wang ICCV’15, by Robotics Institute, Carnegie Mellon University. 2015 ICCV, Over 900 Citations. Contrastive Learning, Unsupervised Learning, Object Detection.
Authors start the paper by asking about the questions:
In this paper,
Approach Overview
Triplet Contrastive Learning!!!
Patch Mining in Videos
Before training, we need to mine the patches. A two-step approach is used.
In the first step, SURF [1] interest points are obtained and Improved Dense Trajectories (IDT) [50] is used to obtain motion of each SURF point, since IDT applies a homography estimation (video stabilization) method, it reduces the problem caused by camera motion.
Given the trajectories of SURF interest points, these points are classified as moving if the flow magnitude is more than 0.5 pixels.
Frames are rejected if
In the second step, the best bounding box is found such that it contains most of the moving SURF points. The size of the bounding box is set as h×w (227×227 in a frame of 448×600). In the second step, a sliding window is used within the frame to find the bounding box that contains the most number of moving SURF interest points.
This patch acts as the similar patch to the query patch in the triplet.
Examples of patch pairs we obtain via patch mining in the videos
Finally, millions of pairs are generated, as shown above.
Three different networks are trained separately using 1.5M, 5M and 8M training samples.
Siamese Triplet Network
A Siamese-triplet network consist of three base networks which share the same parameters.
Thus the final output of each single network is 1024 dimensional feature space f().
Top response regions for the pool5 neurons of our unsupervised-CNN.
Given an image $X$ as an input for the network, we can obtain its feature in the final layer as $f(X)$. Then, the distance of two image patches $X{1}$, $X{2}$ is defined based on the cosine distance in the feature space as:
Formally, $X{i}$ is the original query patch (first patch in tracked frames), $X{i}^{+}$ is the tracked patch and $X_{i}^{-}$ is a random patch from a different video. To enforce:
where $M=0.5$ represents the gap parameters between two distances.
The objective function is:
where $N$ is the number of the triplets of samples. $\lambda$ is a constant representing weight decay, which is set to $\lambda=0.0005$.
After 10 epochs of training using negative data selected randomly, the negative patch is selected such that the loss is maximum.
Specifically, for each pair $\{X{i},X{i}^{+}\}$, the loss of all other negative patches in batch $B=100$ is calculated, and the top $K=4$ negative patches with highest losses are selected.
After fine-tuning on PASCAL VOC 2012, these filters become quite strong
One straight forward approach is directly applying the ranking model as a pre-trained network for the target task. For the fully connected layers, they are initialized randomly.
Here, transfer convolutional parameters are transferred for re-adapting.
After two iterations of this approach, the network converges.
There are billions of videos in YouTube, this opens up the possibility of training multiple CNNs using different sets of data.
Once these CNNs are trained, the fc7 features are appended from each of these CNNs to train the final SVM.
mean Average Precision (mAP) on VOC 2012. “external” column shows the number of patches used to pre-train unsupervised-CNN.
The detection pipeline introduced in R-CNN is followed where the CNNs
The fine-tuned CNN was then used to extract features followed by training SVMs for each object class.
As a baseline, the network is trained from scratch on VOC 2012 dataset and obtain 44% mAP.
Using the proposed unsupervised network pre-trained with 1.5M pair of patches and then fine-tuned on VOC 2012, mAP of 46.2% is obtained (unsup+ft, external data = 1.5M).
However, using more data, 5M and 8M patches in pre-training and then fine-tune, 47% and 47.5% mAP are achieved.
By ensembling two fine-tuned networks which are pre-trained using 1.5M and 5M patches, we obtained a boost of 3.5% comparing to the single model, which is 50.5% (unsup+ft (2 ensemble)).
Finally, all three different networks pre-trained with different sets of data, whose size are 1.5M, 5M and 8M respectively, are ensembled. Another boost is obtained with 52% mAP (unsup+ft (3 ensemble)).
It is noted that ImageNet dataset is a set of labelled data. When ImageNet pre-trained model is used, 50.1% mAP (RCNN 70K) is obtained. The result of ensembling two of these networks is 53.6% mAP (RCNN 70K (2 ensemble)). If three networks are ensembled, a mAP of 54.4% is obtained.
Pre-train needs multi-task!!!
Given the proposed fine-tuned model using 5M patches in pre-training (unsup+ft, external = 5M), it is re-learnt and re-adapted to the unsupervised triplet task. After that, the network is re-applied to fine-tune on VOC 2012. The final result for this single model is 48% mAP (unsup + iterative ft), which is 1% better than the initial fine-tuned network.
mAP of 26.1% is obtained using the proposed unsupervised network (training with 8M data).
The ensemble of two unsupervised-network (training with 5M and 8M data) gets mAP of 28.2%.
As a comparison, ImageNet pretrained network without fine-tuning gets mAP of 40.4%.
We need fine-tune to obtain better representation!!!
The successful implementation opens up a new space for designing unsupervised learning algorithms for CNN training. (There are also results for surface normal estimation, please feel free to read the paper if interested.)
Sik-Ho Tang. Review — Unsupervised Learning of Visual Representations using Videos.