boheumd / MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
https://boheumd.github.io/MA-LMM/
MIT License
178 stars 21 forks source link

Inference result on the breakfast dataset #4

Closed joslefaure closed 2 months ago

joslefaure commented 2 months ago

Dear authors, thank you very much for open-sourcing the code of the paper, it will really benefit the community. I downloaded breakfast from the official website, processed the frames using ffmpeg with 10fps, downloaded the model your provided then ran the test file. I constantly get a result of 81 for top-1 accuracy and 95 for top-5 which is way off from the reported result. Is there anything you think might cause this discrepancy? Thank you

boheumd commented 2 months ago

Hi. It might be helpful to first check the number of frames extracted from each video. Ensure this matches the count provided in the Breakfast annotation file, as a significant difference could potentially be the source of the problem. Also, consider the version of FFMPEG used for extracting the video frames. In some instances, the quality of the extracted frames can be noticeably lower than that of the original videos, which might impact the accuracy.

joslefaure commented 2 months ago

Thanks for your reply. The issue was with image quality as you suspected. Using the highest quality to save the files, the results have improved considerably (top1 acc = 91%) FFMPEG extracted frames are still short by 2 for most videos (but it's fine),