boheumd / MA-LMM

(2024CVPR) MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
https://boheumd.github.io/MA-LMM/
MIT License
221 stars 26 forks source link

How to generate prediction result of a whole video in 'lvu_cls'? #24

Open pilibb0712 opened 2 months ago

pilibb0712 commented 2 months ago

Hi, thank you for your awesome work! There is a question about the final prediction result of lvu_cls. I have found that in your code, the evaluation process are based on the prediction result of images which corresponds to the key of 'image_id' in result file. How can I aggregate the results of images to obtain the prediction result of a whole video when there exist multiple image predictions of the same video?

boheumd commented 2 months ago

Hi. Actually, the image_id is the clip_id used during the evaluation. During the testing time, we report the average accuracy for each video clip which is extracted from the original videos with a fixed sampled stride (=20) instead of reporting the accuracy on the whole video level.