Closed zixuwang1996 closed 3 years ago
Hey! So this code was not meant to FT any pretrained model but just to train a model from scratch, so it's possible the optimization learning rates are not the best (especially the learning rate). I suggest you try a much smaller learning rate for fine-tuning the provided checkpoint.
thank you for your prompt reply!
Sorry one separate question: is the size of S3D Howto100m features [seconds, 1024]? This is different from how you preprocess the video in MIL-NCE training right? cuz in the model the video is preprocessed according to num_frames and ftps to get the seconds in one single clip (second = num_frames/ftps) if I am not misunderstanding.
Hi, thanks for this very useful code.
When I was trying to reproduce and train the model based on the pre-trained weights from S3D_Howto100m, the model quickly outputs all NaN for the video and text embeddings after 132 steps with batch size 1024, which is very strange (still same when I tested different learning rates). I found in the provided checkpoint, there is only weights of the network but no other hyperparameters like the learning rate.
Could you please share the hyperparams after pretraining (maybe this could be the issue)? Also it would be much appreciated if you could shed some light on the bugs I got.
PS. I use the provided S3D video features and only keep the very last linear layer for training the video encoder.