boheumd / MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
https://boheumd.github.io/MA-LMM/
MIT License
214 stars 26 forks source link

Excuse me, who can reproduce the numerical values of msvdqa data used in the paper (top 1 accuracy 60%) #27

Open lxrrrrrr opened 2 months ago

lxrrrrrr commented 2 months ago

I just used an A800 and changed the batch size to 32. The other parameters are consistent with the appendix of the paper. Why can I only achieve 53%

lxrrrrrr commented 2 months ago

I feel like I might have missed something somewhere, let me take a closer look

hulianyuyy commented 2 months ago

I can roughly achieve ~60% accuracy on msvd.

hulianyuyy commented 2 months ago

But i can only get ~42% on msrvtt.

lxrrrrrr commented 2 months ago

I think part of the reason is the way the dataset is processed. Are you using the annotations provided by the author?

hulianyuyy commented 2 months ago

Yes, i use the annotations provided by the author. Maybe the problem is related to this.

lxrrrrrr commented 1 month ago

Many thanks for your then I processed the data according to the code you provided and re-downloaded the msvd dataset using download_scripts in the code, but I can't use annotations provided by the author, there are a lot of data length mismatch will report an error, may I ask you how to deal with it, looking forward to your reply

hulianyuyy commented 1 month ago

You may simply reduce the total num_frames by 1 or 2 in the dataset.py for each dataset.

boheumd commented 1 month ago

Many thanks for your then I processed the data according to the code you provided and re-downloaded the msvd dataset using download_scripts in the code, but I can't use annotations provided by the author, there are a lot of data length mismatch will report an error, may I ask you how to deal with it, looking forward to your reply

Following this https://github.com/boheumd/MA-LMM/issues/3#issuecomment-2053855973. You can update the "frame_length" to your actual extracted frame length for each video in the annotation file.