doc-doc / HQGA

Video as Conditional Graph Hierarchy for Multi-Granular Question Answering (AAAI'22, Oral)
MIT License
31 stars 4 forks source link

Model Performance Related Issues #14

Closed dingkai163 closed 1 year ago

dingkai163 commented 1 year ago

Hi!the performance of the model I trained using your code is 40.95, but the performance of the model you provided is 41.15, I want to know why? In addition, your model seems to be obtained in the second epoch, but your code shows that the model is saved from the third epoch. I am also very confused about this. I look forward to your reply, thank you very much!

doc-doc commented 1 year ago

Hi, please try to finetune the best model obtained at the first stage.

dingkai163 commented 1 year ago

Much appreciated! But the highest accuracy of the model verification set I got in the first stage is 39.47 (the 6th epoch), and the accuracy of fine-tuning in the second stage is all lower than the accuracy of the first stage, which is not the same as what you described! (experiment on MSVD-QA)

doc-doc commented 1 year ago

Not sure the problem here, please try to tune the learning rate, 1e-5 or 5e-5 ...

dingkai163 commented 1 year ago

Well,many Thanks! I have achieved the desired result on the MSVD dataset.

dingkai163 commented 1 year ago

Now I have some problems with experimenting on NExT-QA. According to the video features of NExT-QA you provided (region_feature shape: 3870, 16, 4, 20, 2048), each video has K=16 clips, do I have to make the following changes in the videoqa.py file?

#num_clip, num_frame, num_bbox = 8, 8*4, 10
num_clip, num_frame, num_bbox = 16, 16*4, 20

At the same time, I set batch_size = 32, max_qa_length = 37, and bbox_num = 20, but in this case the performance of the trained model has dropped by 1 percentage point. Is there anything I am missing?

doc-doc commented 1 year ago

Hi, your setting seem correct. How about a batchsize 64?

dingkai163 commented 1 year ago

Thank you very much for your help. After multiple experiments, I have achieved the expected results on the Next-QA dataset. Now, I am facing many problems and difficulties in video feature extraction on the MSRVTT dataset. So, I would like to ask if you could provide me with the modified code for feature extraction (region, appearance, and motion features)? Thank you again for your selfless sharing!

doc-doc commented 1 year ago

Hi, you can find the feature for MSRVTT-QA here.

dingkai163 commented 1 year ago

Thank you for your prompt response! In the MSRVTT dataset you provided, the "clip" dimension for region features is 8, while for appearance and motion features it is 16. This inconsistency may cause some errors during the training process.

Load ../data/msrvtt//region_feat_n/acregion_8c10b_train.h5...
(6513, 8, 4, 10, 2048)
Load ../data/msrvtt//frame_feat/app_feat_train.h5...
(6513, 16, 4, 2048)
Load ../data/msrvtt//mot_feat/mot_feat_train.h5...
(6513, 16, 2048)

Additionally, the height and width of the videos in the MSRVTT dataset are not defined in the CSV file. What should they be set to?

doc-doc commented 1 year ago

Hi, all videos in msrvtt have the same frame size: width x height: 320 x 240. Please try to keep all clip dimension as 8 ([::2]) to run the code.. I can only find this version for the region feature at this moment (borrow from VGT). The difference should be small as this dataset factors little temporal information for answer.

dingkai163 commented 1 year ago

Sincerely thank you for your help and sharing! I have achieved the expected results.