Closed tian1327 closed 1 year ago
If you could share some preliminary results on ActivityNet should you have, it would be very helpful! Thank you!
On this project, I did not try much, but I expect on ActivityNet a higher domain gap with the pretraining dataset used here. However, it should be pretty straightforward to test ActivityNet from this codebase.
Thank you! Yes, it is easy to try. But the results in the paper shows that finetuning on the downstream tasks significantly improves the dense video captioning scores than zeroshot (tab 7, 8). Given only the checkpoints in ViTT and YouCook2 are provided, I guess the best try I can do is to take the ckpt finetuned on ViTT and test on ActNet, since ViTT is more diverse than YouCook2.
Again, thank you for the great work!
Great work Antoine! In your last paper Vid2Seq, you also tested the pre-trained model on the ActivityNet captions dataset, but in VidChapters you only show on ViTT and YouCook2. I am wondering if there is any particular reason to pick ViTT and YouCook2. Is it because the ActivityNet captions dataset is larger than these two (i.e. longer training time) or is it because it contains more diverse activities which makes it a harder dataset?
Thank you!