boheumd / A2Summ

The official implementation of 'Align and Attend: Multimodal Summarization with Dual Contrastive Losses' (CVPR 2023)
https://boheumd.github.io/A2Summ/
71 stars 10 forks source link

F-score results #7

Closed zhanzhuxi closed 1 year ago

zhanzhuxi commented 1 year ago

Thanks for sharing your open-source code! I have run the project and got F-score on the SumMe and TVSum datasets. But the results seem different from those in the paper. The following are the results I got.

For SumMe: F1_results: {'split0': 0.4879799819599918, 'split1': 0.5798566120133917, 'split2': 0.5529651338830718, 'split3': 0.5264276927288803, 'split4': 0.4501540520029801} F1-score: 0.5195

For TVSum: F1_results: {'split0': 0.6302811067187952, 'split1': 0.5925235378291761, 'split2': 0.6476801856701864, 'split3': 0.6284219094681587, 'split4': 0.6346191839437079} F1-score: 0.6267

The results on the SumMe dataset are quite different from those in the paper. The parameters in the code have no other modification except setting num_workers to 0 due to the code running error. I completed the experiment with the following settings, which are inconsistent with the settings in the readme file.

boheumd commented 1 year ago

We appreciate your interest in our work. We have noticed that there can be significant variations in F1 scores on SumMe/TVSum datasets when different versions of PyTorch/CUDA/GPU are used.

Our hypothesis is that these datasets are very sensitive to model initialization due to their small size, with only 25 and 50 videos in SumMe and TVSum, respectively. Tuning the model for optimal performance on different environment versions may take some time. However, we have saved checkpoints available for you to test and evaluate the results.

If possible, we suggest trying our model on our BLiSS or Daily_Mail datasets for more consistent and stable results as they have a much larger dataset size.