Unsupervised Learning of Video Representations using LSTMs(https://arxiv.org/pdf/1502.04681.pdf)

Agenda - 1) Use deep Long Short-Term Memory (LSTM) Encoder-Decoder architecture to learn representations of video sequences. 2) Get a qualitative understanding of what the LSTM learns to do. 3) Measure the benefit of initializing networks for supervised learning tasks with the weights found by unsupervised learning, especially with very few training examples. 4) Compare the different proposed models - Autoencoder, Future Predictor and Composite models and their conditional variants. 5) Compare with state-of-the-art action recognition benchmarks.

Dataset - 1) UCF-101, HMDB-51,MNIST, and youtube-300 supervised task. 2) Sports-1M dataset for unsupervised task.

Highlights - 1) Input sequence ---> LSTM encoder ---> Fixed length representation ---> LSTM decoder ---> to decode input sequence + LSTM decoder ---> make final predictions/Target Sequence. 2) Image Patches - For this, natural image patches, as well as a dataset of moving MNIST digits, is used. 3) High-level representations (“percepts”) - Extraction of features is done by applying a convolutional net trained on ImageNet(pre-trained). 4) The target sequence is same as the input sequence, but in reverse order. Reversing the target sequence makesthe optimization easier because the model can get off the ground by looking at low range correlations. 5) The number of hidden layers are fixed so that the model can not learn trivial mappings for arbitrary length input sequences.

Experiments performed -
A) Experiments on MNIST - (Moving digits bouncing off walls) - Success 1) Each digit was assigned a velocity whose direction was chosen randomly on a unit circle and whose magnitude was also chosen uniformly at random over a fixed range. 2) LSTM with 2048 units, 10 frames for encoder, 10 frames for decoder and 10 frames for prediction. 3) Experimented with both Single layer and 2 layer composite model.

B) Experiment on Natural Image Patches - Failed 1) Training on 32x32 patches from the UCF dataset. 2) The reconstructions and the predictions were found to be both very blurry. 3) Experimented with both Single layer and 2 layer composite model.

C) Out-of-domain Inputs - (Multiple floating digits - 1 and 3 digits) - Fail 1) It was obeserved that for one moving digit, the model can do a good job but it really tries to hallucinate a second digit overlapping with the first one. 2) The second digit shows up towards the end of the future reconstruction. 3) For three digits, the model merges digits into blobs. 4) However, it does well at getting the overall motion right. This highlights a key drawback of modeling entire frames of input in a single pass. 5) Conclusion - The model needs to know about motion (which direction and how fast things are moving) from the input. This requires precise information about location (thin strips) and velocity (high frequency strips). But when it is generating the output, the model wants to hedge its bets so that it does not suffer a huge loss for predicting things sharply at the wrong place.

D) Action Recognition - (One action per video) 1) 2 layer Composite Model with 2048 hidden units with no conditioning on either decoders was trained. 2) The model was trained on Youtube 300 hours dataset. 3) The model was trained to autoencode 16 frames and predict the next 13 frames. 4) The LSTM classifier is initialized with the weights learned by the encoder LSTM from this model. 4) To get a prediction for the entire video, we average the predictions from all 16 frame blocks in the video with a stride of 8 frames. Using a smaller stride did not improve results. 5) The baseline for comparing these models is an identical LSTM classifier but with randomly initialized weights. 6) Instead of using dropout regularization, activations were dropped as they were communicated across layers but not through time within the same LSTM. 7) for the case of very few training examples, unsupervised learning gives a substantial improvement. 8) Test Dataset - UCF and HMDB.

Results - 1) For MNIST, cross entropy of the predictions was computed with respect to the ground truth, both of which are 64 ×64 patches. For natural image patches, squared loss was Computed. The Composite Model always does a better job of predicting the future compared to the Future Predictor.This indicates that having the autoencoder along with the future predictor to force the model to remember more about the inputs actually helps predict the future better. 2) All unsupervised models improve over the baseline LSTM which is itself well-regularized by using dropout. The Autoencoder model seems to perform consistently better than the Future Predictor. The Composite model which combines the two does better than either one alone. Conditioning on the generated inputs does not seem to give a clear advantage over not doing so. The Composite Model with a conditional future predictor works the best, although its performance is almost same as that of the Composite Model.

Future work - To further get improvements for supervised tasks, it is believed that the model can be extended by applying it convolutionally across patches of the video and stacking multiple layers of such models. Applying this model in the lower layers of a convolutional net could help extract motion information that would otherwise be lost across max-pooling layers. Also, models can be based on these autoencoders from the bottom up instead of applying them only to percepts.

RoshanRane / PredNet-and-Predictive-Coding-A-Critical-Review

Unsupervised Learning of Video Representations using LSTMs(https://arxiv.org/pdf/1502.04681.pdf) #10