Closed sarimmehdi closed 4 years ago
Hi,
... where you train a predictor to give input to the decoder is very useful
: sorry I don't understand this, we don't have the decoder in the model as we are not reconstructing. Clarify? The predictor function takes the hidden vector from time step t and computes the future representation for it and uses that as input to a ConvGRU cell at the next time instant t+1. From time step t+1 and onwards is also when you do the classification task by taking the representation c(t+1) and passing it through pooling and fc layer to do action classification. I just thought that is kind of like a decoder?
Also, do you think 5 frames per time step is the optimal choice or is it possible to get better results (for my task at least) using less? Something like 3 or even 2 frames?
If you only use the representation at c_t, what's the point in training for representations after that? Furthermore, can you tell me how long it took for you to train on your dataset? I am working on the KITTI dataset with a sequence length of 5 images (a block of 5 images spans 30 seconds due to the low fps) and after 10 epochs the validation loss doesn't seem to go below 4.01 (there is a very slow improvement in the top1 accuracy though).
I am using Google Colab to train on 8008 KITTI images and I artificially increase the dataset length by using a sliding window approach (so first sequence of images contains image 0 to image 49 and the next sequence contains image 1 to image 50 and the sliding window travels forward in time by 1 image)
The intuition is, if c_t is capable to predict long future, then c_t must have encoded a strong representation for the video. On either UCF101 or Kinetics400, the initial training loss will decrease and saturate for a while, then at some points, the model will automatically learn to distinguish harder distractors. Here is an example curve, note that NO learning rate change, NO hyperparameter tuning: As for time, in the paper setting (128x128 or 224x224, rgb video input) normally with one modern GPU within a half hour, you will see some sudden decrease in the loss, indicating model has discovered something. Since your setting is different, be patient and have a try. I'm not sure what you mean by sequence sampling. As long as in each video, these multiple video blocks don't have repeated frames with each other, it should be fine. Otherwise, there is a risk of cheating for the prediction task.
Thank you very much. That cleared up a lot of confusion!
Hello. I wanted to know whether you ran any experiments with just one image at every time step (so, seq_len is now 1 and not 5) instead of more than one. Actually, in my application (predicting the trajectory of cars in terms of bounding box coordinates based on a number of input frames), the consecutive frames are not so different and so I wanted to know whether extracting only spatial features at every time step (instead of also taking into account the temporal ones) will make some drastic difference in terms of using your network?
I understand that you used your network for action classification but I think the part where you train a predictor to give input to the decoder is very useful. Please let me know what you think about such a change in your network.