RoshanRane / PredNet-and-Predictive-Coding-A-Critical-Review

Performing video classificaiton by using the predictive processing architecture. The model is trained in a self-supervised manner to predict the next frames in videos along with the supervised video action classification task.
2 stars 2 forks source link

PredNet 2017 #1

Closed vageeshSaxena closed 6 years ago

vageeshSaxena commented 6 years ago

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning (https://arxiv.org/abs/1605.08104)

Problem statement - prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world using Pred-Net Architecture.

Network Architecture - recurrent convolutional network(CNN-LSTM) with both bottom-up and top-down connections. Briefly, each module of the network consists of four basic parts: an input convolutional layer (Al), a recurrent representation layer (RL), a prediction layer (Aˆl), and an error representation (El).

Key Concept - Top-down connections conveys predictions that are compared against actual observations to generate an error signal. The error signal is then propagated back up the hierarchy, eventually leading to an update of the predictions. Each layer consists of representation neurons (Rl), which output a layer-specific prediction at each time step (A^t), which is compared against a target (At) to produce an error term (Et), which is then propagated laterally and vertically in the network.

Benchmark - Same model architecture that passes model predictions instead of the error.

Experiments Performed -

A) Rendered Image Sequences: 1) Synthetic sequences were chosen as the initial training set in order to better understand what is learned in different layers of the model, specifically with respect to the underlying generative model. 2) First, the trained models learn a representation that generally permits a better linear decoding of the underlying latent factors than the randomly initialized model, with the most striking difference in terms of the pan rotation speed (αpan). Second, the most notable difference between the Lall and L0 versions occurs with the first principle component, where the model trained with a loss on all layers has a higher decoding accuracy than the model trained with the loss only on the lowest layer. 3) The latent variable decoding analysis suggests that the model learns a representation that may generalize well to other tasks for which it was not explicitly trained.

B) Static Face Classification task : 1) Benchmark - Standard autoencoder and a variant of Ladder network - same configuration as that of PredNet. 2) The Ladder Network has lateral and top-down streams that are combined with a convolutional combinator function. 3) Altogether, these results suggest that predictive training with the PredNet can be a viable alternative to other models trained with a more traditional reconstructive or denoising loss and that the relative layer loss weightings (λl’s) may be important for the particular task at hand.

C) Natural Image Sequencing: 1) As a testbed, the implementation chosed car-mounted camera videos, since these videos span across a wide range of settings and are characterized by rich temporal dynamics, including both self-motion of the vehicle and the motion of other objects in the scene. 2) Training Dataset - KITTI dataset. 3) A random hyperparameter search, with model selection based on the validation set, resulted in a 4 layer model with 3x3 convolutions and layer channel sizes of (3, 48, 96, 192). 4) Testing dateset - CalTech Pedestrian dataset. 5) Benchmark - CNN-LSTM encoder decoder model. 6) The elementwise subtraction operation in the PredNet seems to be beneficial, and the nonlinearity of positive/negative splitting also adds modest improvements. 7) To check the implicit encoding of latent parameters,internal representation from perdnet was used to estimate steering angle. 8) Dataset - Comma.ai 9) The network was first trained networks for next-frame prediction and then fit a linear fully-connected layer on the learned representation to estimate the steering angle, using a MSE loss.

Results - 1) PredNet model outperforms the model by Brabandere et al. (2016) by 29%. 2) Without re-optimizing hyperparameters, the model underperforms the concurrently developed DNA model by Finn et al. (2016), but outperforms the model by Mathieu et al. (2016). 3) The learning to predict how an object or scene will move in a future frame confers advantages in decoding latent parameters (such as viewing angle) that give rise to an object’s appearance, and can improve recognition performance. 4) The prediction can serve as a powerful unsupervised learning signal, since accurately predicting future frames requires at least an implicit model of the objects that make up the scene and how they are allowed to move.

RoshanRane commented 6 years ago

Critiques :

  1. The experiment on the synthetic dataset to discover if the model learns the latent features seems shallow. The SVM classifier they use on top of the PredNet representations might be powerful enough to have learnt the latent features themselves. Also the latent features where not very difficult ones. They could have compared with their baseline model's representations.
  2. The datasets used are not very challenging because car-cam setting data doesn't have a lot of variations and different objects and backgrounds.
  3. The slight improvements that they report on the baseline could also be because of the extra RELu units they have in the error propagation part. The argument is not convincing.
  4. The comparison of L-0 and L-all model is not right because the comparison is done on the reconstruction loss. If a model is trained on L-0 it obviously will do a better reconstruction but that necessarily doesn't mean that the L-all model has not learned a better latent space representation.