HumamAlwassel / TSP

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks (ICCVW 2021)
http://humamalwassel.com/publication/tsp/
MIT License
107 stars 16 forks source link

LOSS does not decrease during training #15

Closed ZChengLong578 closed 2 years ago

ZChengLong578 commented 2 years ago

My data set is small, 1500 videos, all under 10 seconds in length. The current training results of this model are as follows: 1640047275(1)

The experimental Settings adopted are: Batch_size=32,FACTOR=2. Is such a situation normal? If it is abnormal, what should be done?

HumamAlwassel commented 2 years ago

Hi @ZChengLong578,

Thanks for your interest in our work.

Can you give me more info about your setup?

1) I see that your temporal-region-label accuracy is 100% from the first epoch (which is strange). Are you training on trimmed or untrimmed videos? It seems that you are training on trimmed videos. TSP is designed for untrimmed videos, where you have segments of both action and no action (background).

2) What is your train/validation split? How many clips are you sampling from each video in one epoch?

3) How many action classes do you have?

4) What learning rate are you using? if you have a small training dataset, I'd recommend using a smaller LR and fine-tune directly from the released pretrained TSP-on-ActivityNet model instead of the TAC-on-Kinetics model.

Cheers, Humam

ZChengLong578 commented 2 years ago

Hi, @HumamAlwassel Thanks for your reply! The assignment of train and val in my data set is 8:3. Since there are only a few No actions, the accuracy is 100%. About 3-5 clips are sampled in a video, with a total of 44 action categories. At present, the learning rate is BACKBONE_LR=0.0001,FC_LR=0.002,BACKBONE= r2plus1D_34. Also, while debugging the model, I output the log file for train: 1640220318(1) The log file for val is: 1640220368(1) Therefore, it is judged that the overfitting problem is caused by the small amount of data, and we choose to add Dropout layer to solve the problem. At present, setting parameters P =0.2 and P =0.5 fails to solve the problem. Next, we plan to simplify the network structure to try to solve the problem. Do you have any suggestions?

HumamAlwassel commented 2 years ago

I see! Yes, there is definitely an overfitting issue due to the small size of the training subset.

Here are a couple of things you can try to counter this overfitting: 1) Download the R(2+1)D-34 TSP-on-ActivityNet-pretrained model from here and use it as initialization for your training. Pass the checkpoint to the argument --resume. You might need to overwrite the epoch number to reset it to 0 after reading the checkpoint (or you can simply read the checkpoint file in a Jupyter notebook, edit the epoch number, and overwrite the file). You should drop the BACKBONE_LR significantly since this is a pretrained model on the same task. I recommend starting with BACKBONE_LR=0.00001. The FC_LR should stay around the same magnitude because the FC layers will be randomly initialized to accommodate your dataset (e.g. FC_LR=0.002).

2) You can try training with the smaller R(2+1)D-18 encoder instead of the default R(2+1)D-34. You can start from the TAC-on-Kinetics-pretrained weights (by simply setting BACKBONE= r2plus1d_18). Or similarly to (1), you can start your training from the R(2+1)D-18 TSP-on-ActivityNet-pretrained model, which you can download from here. If you try the latter, you should also drop the BACKBONE_LR and reset the epoch number.

3) I also recommend increasing the number of clips per video as much as you can to allow for a larger training set.

Hope these can help

ZChengLong578 commented 2 years ago

Hi, @HumamAlwassel Thank you for your reply! I'm sorry to trouble you again. I tried R(2+1)D-18, and the running result is shown as follows: 1640350771(1) 1640350797(1) As you can see, the overfitting problem is still unsolved, and the accuracy has decreased by 20%. I then tried the R (2 + 1) D - 18 TSP - on - ActivityNet - pretrained model, after adjusting for FC layer, then an error is as follows: ValueError: loaded state dict has a different number of parameter groups I found that BatchNorm3d() is used to prevent overfitting. It has been used in the code, but overfitting problems still occur. I would like to ask what are the appropriate values of BatchNorm3d() parameters EPS and momentum?

HumamAlwassel commented 2 years ago

Hi @ZChengLong578

Happy new year, and sorry for the delayed response.

ValueError: loaded state dict has a different number of parameter groups

I think you might have messed up the structure of the state_dict in the checkpoint. I suggest following the same strategy we use here to remove the pretrained weights of the FC layer and only load the pretrained backbone weights. Ensure that the model is created with the correct num_classes for your dataset.

values of BatchNorm3d() parameters EPS and momentum

Here are the values we use in the code.

I'm going to close the issue, but feel free to continue posting any further questions here, and I'll get back to you as soon as I can.

Cheers!