boheumd / A2Summ

The official implementation of 'Align and Attend: Multimodal Summarization with Dual Contrastive Losses' (CVPR 2023)
https://boheumd.github.io/A2Summ/
62 stars 10 forks source link

Inference Code #8

Open purbayankar opened 1 year ago

purbayankar commented 1 year ago

Thanks for this great work and congratulations for the paper being accepted in CVPR 2023. Can you please provide the inference code for a single video? It will be extremely helpful.

Pwoer-zy commented 1 year ago

Hello, I want it too, can you share it? Thank you so much

boheumd commented 1 year ago

Hello. Thanks for your interest in our work. At the moment, our codebase only supports offline feature extraction, and the model runs on the pre-extracted video and text features. To enable inference on a single video, online feature extraction needs to be implemented. However, I am currently busy with other projects, I won't be able to implement this immediately. But I will let you know if I have any updates. Thank you!

Pwoer-zy commented 1 year ago

Hello. Thanks for your interest in our work. At the moment, our codebase only supports offline feature extraction, and the model runs on the pre-extracted video and text features. To enable inference on a single video, online feature extraction needs to be implemented. However, I am currently busy with other projects, I won't be able to implement this immediately. But I will let you know if I have any updates. Thank you!

Hello, I would like to ask in your dissertation. How is the summary result chart of the TVSUM dataset drawn? There are also some video visualization results in BLISS Dataset. Looking forward to your reply

boheumd commented 1 year ago

Hello, for the visualization on the TVSum, the GT is the ground-truth annotated importance score for each frame. The baseline and the ours shows the predicted video summary, which you can obtain from the evaluation stage (https://github.com/boheumd/A2Summ/blob/main/train_videosumm.py#L230). For the visualization on the BLISS dataset, following the similar procedure, the baseline and the ours shows the predicted text summary, and the temporal order is according to the timestep of each text sentence.

Hello. Thanks for your interest in our work. At the moment, our codebase only supports offline feature extraction, and the model runs on the pre-extracted video and text features. To enable inference on a single video, online feature extraction needs to be implemented. However, I am currently busy with other projects, I won't be able to implement this immediately. But I will let you know if I have any updates. Thank you!

Hello, I would like to ask in your dissertation. How is the summary result chart of the TVSUM dataset drawn? There are also some video visualization results in BLISS Dataset. Looking forward to your reply