Dense video captioning on ActivityNet Captions dataset?

antoyang / VidChapters

[NeurIPS 2023 D&B] VidChapters-7M: Video Chapters at Scale

http://arxiv.org/abs/2309.13952

MIT License

174 stars 21 forks source link

Dense video captioning on ActivityNet Captions dataset? #5

Closed tian1327 closed 1 year ago

tian1327 commented 1 year ago

Great work Antoine! In your last paper Vid2Seq, you also tested the pre-trained model on the ActivityNet captions dataset, but in VidChapters you only show on ViTT and YouCook2. I am wondering if there is any particular reason to pick ViTT and YouCook2. Is it because the ActivityNet captions dataset is larger than these two (i.e. longer training time) or is it because it contains more diverse activities which makes it a harder dataset?

Thank you!

tian1327 commented 1 year ago

If you could share some preliminary results on ActivityNet should you have, it would be very helpful! Thank you!

antoyang commented 1 year ago

On this project, I did not try much, but I expect on ActivityNet a higher domain gap with the pretraining dataset used here. However, it should be pretty straightforward to test ActivityNet from this codebase.

tian1327 commented 1 year ago

Thank you! Yes, it is easy to try. But the results in the paper shows that finetuning on the downstream tasks significantly improves the dense video captioning scores than zeroshot (tab 7, 8). Given only the checkpoints in ViTT and YouCook2 are provided, I guess the best try I can do is to take the ckpt finetuned on ViTT and test on ActNet, since ViTT is more diverse than YouCook2.

Again, thank you for the great work!