Closed zhanzhuxi closed 1 year ago
We appreciate your interest in our work. We have noticed that there can be significant variations in F1 scores on SumMe/TVSum datasets when different versions of PyTorch/CUDA/GPU are used.
Our hypothesis is that these datasets are very sensitive to model initialization due to their small size, with only 25 and 50 videos in SumMe and TVSum, respectively. Tuning the model for optimal performance on different environment versions may take some time. However, we have saved checkpoints available for you to test and evaluate the results.
If possible, we suggest trying our model on our BLiSS or Daily_Mail datasets for more consistent and stable results as they have a much larger dataset size.
Thanks for sharing your open-source code! I have run the project and got F-score on the SumMe and TVSum datasets. But the results seem different from those in the paper. The following are the results I got.
For SumMe: F1_results: {'split0': 0.4879799819599918, 'split1': 0.5798566120133917, 'split2': 0.5529651338830718, 'split3': 0.5264276927288803, 'split4': 0.4501540520029801} F1-score: 0.5195
For TVSum: F1_results: {'split0': 0.6302811067187952, 'split1': 0.5925235378291761, 'split2': 0.6476801856701864, 'split3': 0.6284219094681587, 'split4': 0.6346191839437079} F1-score: 0.6267
The results on the SumMe dataset are quite different from those in the paper. The parameters in the code have no other modification except setting num_workers to 0 due to the code running error. I completed the experiment with the following settings, which are inconsistent with the settings in the readme file.