boheumd / A2Summ

The official implementation of 'Align and Attend: Multimodal Summarization with Dual Contrastive Losses' (CVPR 2023)
https://boheumd.github.io/A2Summ/
70 stars 10 forks source link

Doubts about data sources for text modal #18

Open XYI-xue opened 5 months ago

XYI-xue commented 5 months ago

Hi author, I would like to ask how you got the transcribed text corresponding to the videos in the SumMe and TVSum datasets? Was it created manually or did you use an existing model?I am very much looking forward to your answer.

boheumd commented 5 months ago

Hi, you can refer to the implementation details section in the main paper. For the SumMe and TVSum dataset, we adopt the pre-trained image caption model GPT-2 to generate the caption for each frame.

XYI-xue commented 5 months ago

Hi, you can refer to the implementation details section in the main paper. For the SumMe and TVSum dataset, we adopt the pre-trained image caption model GPT-2 to generate the caption for each frame.

Thank you very much for your reply! I would also like to ask, how did you obtain the original video data in the data set? I looked at the dataset file you provided and found that it only has relevant feature values ​​for each video. Could you please provide the link or file where I can obtain the original video data that can be played? Thanks for your help and looking forward to your reply.(☆▽☆)