Using only one image at every time step instead of more than one?

TengdaHan / DPC

Video Representation Learning by Dense Predictive Coding. Tengda Han, Weidi Xie, Andrew Zisserman.

MIT License

250 stars 34 forks source link

Using only one image at every time step instead of more than one? #5

Closed sarimmehdi closed 4 years ago

sarimmehdi commented 4 years ago

Hello. I wanted to know whether you ran any experiments with just one image at every time step (so, seq_len is now 1 and not 5) instead of more than one. Actually, in my application (predicting the trajectory of cars in terms of bounding box coordinates based on a number of input frames), the consecutive frames are not so different and so I wanted to know whether extracting only spatial features at every time step (instead of also taking into account the temporal ones) will make some drastic difference in terms of using your network?

I understand that you used your network for action classification but I think the part where you train a predictor to give input to the decoder is very useful. Please let me know what you think about such a change in your network.

TengdaHan commented 4 years ago

Hi,

We didn't try frame-wise prediction, the intuition behind is that consecutive frames look very similar and could be too easy for the network (cheating by just copying last frame feature) thus too hard to learn useful high-level feature, also it could make contrastive loss useless (i.e. too similar, no contrast). In our paper single time step spans ~0.5 second, predicting a future state within a temporal window is more feasible.
In your application, if the consecutive frames are also very similar, you may encounter the same problem and training could be hard. You can still have a try, or alternatively, you can use a lower frame rate (i.e. 30fps -> 2fps) to control the frames not be too similar.
... where you train a predictor to give input to the decoder is very useful: sorry I don't understand this, we don't have the decoder in the model as we are not reconstructing. Clarify?

sarimmehdi commented 4 years ago

The predictor function takes the hidden vector from time step t and computes the future representation for it and uses that as input to a ConvGRU cell at the next time instant t+1. From time step t+1 and onwards is also when you do the classification task by taking the representation c(t+1) and passing it through pooling and fc layer to do action classification. I just thought that is kind of like a decoder?

sarimmehdi commented 4 years ago

Also, do you think 5 frames per time step is the optimal choice or is it possible to get better results (for my task at least) using less? Something like 3 or even 2 frames?

TengdaHan commented 4 years ago

Note: when evaluating the representation, we only take ct, pass through pooling and fc layer for action classification. We didn't use c{t+1}.
5 frames may not be optimal. But IMO, the spanning time is more important than number of frames. i.e. each video block spanning over 0.5 s or 5s is a big difference, but for the same time spanning, using 2 frames or 5 frames is not a big difference. Therefore we didn't experiment this in our paper. Of course you can do experiments and find your optimal settings.

sarimmehdi commented 4 years ago

If you only use the representation at c_t, what's the point in training for representations after that? Furthermore, can you tell me how long it took for you to train on your dataset? I am working on the KITTI dataset with a sequence length of 5 images (a block of 5 images spans 30 seconds due to the low fps) and after 10 epochs the validation loss doesn't seem to go below 4.01 (there is a very slow improvement in the top1 accuracy though).

I am using Google Colab to train on 8008 KITTI images and I artificially increase the dataset length by using a sliding window approach (so first sequence of images contains image 0 to image 49 and the next sequence contains image 1 to image 50 and the sliding window travels forward in time by 1 image)

TengdaHan commented 4 years ago

The intuition is, if c_t is capable to predict long future, then c_t must have encoded a strong representation for the video. On either UCF101 or Kinetics400, the initial training loss will decrease and saturate for a while, then at some points, the model will automatically learn to distinguish harder distractors. Here is an example curve, note that NO learning rate change, NO hyperparameter tuning: dpc_curve As for time, in the paper setting (128x128 or 224x224, rgb video input) normally with one modern GPU within a half hour, you will see some sudden decrease in the loss, indicating model has discovered something. Since your setting is different, be patient and have a try. I'm not sure what you mean by sequence sampling. As long as in each video, these multiple video blocks don't have repeated frames with each other, it should be fine. Otherwise, there is a risk of cheating for the prediction task.

sarimmehdi commented 4 years ago

Thank you very much. That cleared up a lot of confusion!