OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
267 stars 13 forks source link

Question about training MSVD. #14

Closed ikodoh closed 8 months ago

ikodoh commented 9 months ago

Thank you for sharing great work.

I'm trying to reproduce MSVD retrieval result but there are some minor questions. How did you handle multiple sentences per one video in training and test, respectively. I think that each sentence can be regarded as individual samples during training, but one sentence has to be selected per one video during test. Can you share your protocol to handle this?

Thanks again for your great work.

Andy1621 commented 9 months ago

Please check here https://github.com/OpenGVLab/unmasked_teacher/issues/12#issuecomment-1723248511. I have uploaded the test files we used.

ikodoh commented 9 months ago

Thanks for uploading test file.

However, in the test file, one video includes multiple sentences and I still have no idea to handle those multiple sentences when training and test. Can you provide further implementation details for this?

Andy1621 commented 9 months ago

Please check the warning first. For multiple sentences, you can refer to the code here. It will simply use multiple text as ground truths.

ikodoh commented 9 months ago

Thanks for the response.

As I understand, multiple sentences per one video is regarded as an individual sample. This does not make any problem in training but I think that one video has to match one sentence during test since one sentence has to be retrieved by one video. How did you deal with this?

Andy1621 commented 9 months ago

In my opinion, during testing, it is like a question with multiple answers.

Andy1621 commented 9 months ago

You can check the code in https://github.com/OpenGVLab/unmasked_teacher/blob/8e06491336f8b1692b4ffafd99d23d0fadf362d3/multi_modality/tasks/retrieval_utils.py#L400-L420

ikodoh commented 8 months ago

Hi, I'd like to cite your work as a baseline of our project. Can you provide me the updated results of text-to-video retrieval and video-to-text retrieval on MSVD? Current results are extremely high.

Andy1621 commented 8 months ago

Hi! I have updated some results in MODEL_ZOO. The b_17M is still running.

Andy1621 commented 8 months ago

All the results have been updated.

ikodoh commented 8 months ago

Thank you for sharing the results.