antoyang / just-ask

[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
https://arxiv.org/abs/2012.00451
Apache License 2.0
117 stars 15 forks source link

Pre-training / fine tuning #1

Closed Tortoise17 closed 2 years ago

Tortoise17 commented 3 years ago

Is it possible to use the tool for our own videos and dataset? If yes, in addition to videos, what features are required for pre-training or fine tuning? I assume from your readme that : How to 100M feature extractor with mixture of expert based on this repository, the features are extracted/exported in addition to the spoken speech to text transcript? Or correct me if I am wrong.

Because, I want to test this system with my own videos to see how much it can handle the explanation of the videos and how can I train for my own videos.

Please guide.

antoyang commented 3 years ago

From your message, it is unclear whether 1) you want to generate VideoQA annotations for your own narrated videos as in our method 2) you want to train a VideoQA model on your own videos.

If 1) You can follow the steps "HowToVQA69M generation" explained in the readme, and replace the HowTo100M annotations by the speech annotations corresponding to your own videos. In detail, you need a pickle file similar to the one of HowTo100M (dictionary mapping each video id to dictionaries mapping 'start' to the list of start times of the speech segments, 'end' to the list of end times of the speech segments and 'text' to the list of speech segments as strings). After generating VideoQA annotations, you can jump to 2.

If 2) To train our VideoQA model, you need to extract S3D features on your own videos as explained in the "Extract video features" section of the readme. The code in our repository should be sufficient for extraction if you download the S3D model weights as explained in the readme. Then the adaptation of the reminder of the training code to your own dataset should be straightforward: you can simply prepare csv files for your VideoQA dataset as the ones from the VideoQA datasets used in this repository.

Hope it helps, Antoine Yang

Tortoise17 commented 3 years ago

Thank you so much. Your work is really interesting. Thank you for answering. I am trying to make both possibilities 1 and 2. and thank you for explaining both. I will try and will reach you where I get stuck again.