TengdaHan / DPC

Video Representation Learning by Dense Predictive Coding. Tengda Han, Weidi Xie, Andrew Zisserman.
MIT License
251 stars 34 forks source link

Possible reasons for loss function not going down #7

Closed sarimmehdi closed 4 years ago

sarimmehdi commented 4 years ago

Hello. I am using your network for self-supervision representation on the kitti dataset. Even after 22 epochs the loss function and top1 accuracy barely change. What possible reasons could explain this?

TengdaHan commented 4 years ago

What downstream tasks are you trying to do on Kitti?

sarimmehdi commented 4 years ago

My idea is to extract features from the first N frames and then concatenate those features (after pooling and fully connected layer) with the corresponding bounding box encoded features (these are obviously extracted using a traditional encoder RNN framework and not yours) before sending them at each time step of a decoder to predict bounding box coordinates

sarimmehdi commented 4 years ago
sarimmehdi commented 4 years ago

I tried changing many hyperparameters like learning rate and weight decay (increased and decreased both of them). Total images in my training set are 5400 and 2160 in the validation set. With a batch size of 16, that is not so many image sequences to train with, especially with num_seq set to 8 and seq_len set to 5 (I also tried with 3).

I noticed that no matter how much I change the hyperparameters, the loss function gets to 3.8 and the top1 accuracy gets to 0.220 within the first 10 epochs and after that, there is little to no change whatsoever. In fact, I waited around for 300 epochs but the loss and top1 accuracy stayed the same right till the end.

I haven't made any changes to your architecture whatsoever (just cloned your repository and ran the code, I had to write a different CustomDataset class of course to load the kitti images). I think it is possible that your self-supervision probably doesn't work on datasets like kitti? I am honestly quite new to all this (started neural networks two months ago), so I can't understand the real reason here. Maybe you have a better idea of what could possibly be going wrong here?

Thanks

TengdaHan commented 4 years ago

Hi. Several reasons.

  1. The motion you want to encode (a few pixel change) may be too tiny for our DPC. DPC aims to learn high-level representations, like the action class level, and we deliberately avoid learning low-level features like appearance, texture, etc. But self-supervised tracking probably relies more on these lower-level features.
  2. For video object tracking, I recommend you can check out CorrFlow and the paper. That could be more relevant to your task.
sarimmehdi commented 4 years ago

Thank you very much for your help. I will definitely give this a look