Closed dingkai163 closed 1 year ago
Hi, please try to finetune the best model obtained at the first stage.
Much appreciated! But the highest accuracy of the model verification set I got in the first stage is 39.47 (the 6th epoch), and the accuracy of fine-tuning in the second stage is all lower than the accuracy of the first stage, which is not the same as what you described! (experiment on MSVD-QA)
Not sure the problem here, please try to tune the learning rate, 1e-5 or 5e-5 ...
Well,many Thanks! I have achieved the desired result on the MSVD dataset.
Now I have some problems with experimenting on NExT-QA. According to the video features of NExT-QA you provided (region_feature shape: 3870, 16, 4, 20, 2048), each video has K=16 clips, do I have to make the following changes in the videoqa.py file?
#num_clip, num_frame, num_bbox = 8, 8*4, 10
num_clip, num_frame, num_bbox = 16, 16*4, 20
At the same time, I set batch_size = 32, max_qa_length = 37, and bbox_num = 20, but in this case the performance of the trained model has dropped by 1 percentage point. Is there anything I am missing?
Hi, your setting seem correct. How about a batchsize 64?
Thank you very much for your help. After multiple experiments, I have achieved the expected results on the Next-QA dataset. Now, I am facing many problems and difficulties in video feature extraction on the MSRVTT dataset. So, I would like to ask if you could provide me with the modified code for feature extraction (region, appearance, and motion features)? Thank you again for your selfless sharing!
Thank you for your prompt response! In the MSRVTT dataset you provided, the "clip" dimension for region features is 8, while for appearance and motion features it is 16. This inconsistency may cause some errors during the training process.
Load ../data/msrvtt//region_feat_n/acregion_8c10b_train.h5...
(6513, 8, 4, 10, 2048)
Load ../data/msrvtt//frame_feat/app_feat_train.h5...
(6513, 16, 4, 2048)
Load ../data/msrvtt//mot_feat/mot_feat_train.h5...
(6513, 16, 2048)
Additionally, the height and width of the videos in the MSRVTT dataset are not defined in the CSV file. What should they be set to?
Hi, all videos in msrvtt have the same frame size: width x height: 320 x 240. Please try to keep all clip dimension as 8 ([::2]) to run the code.. I can only find this version for the region feature at this moment (borrow from VGT). The difference should be small as this dataset factors little temporal information for answer.
Sincerely thank you for your help and sharing! I have achieved the expected results.
Hi!the performance of the model I trained using your code is 40.95, but the performance of the model you provided is 41.15, I want to know why? In addition, your model seems to be obtained in the second epoch, but your code shows that the model is saved from the third epoch. I am also very confused about this. I look forward to your reply, thank you very much!