PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

research about a model video captioning #8

Closed pphuc25 closed 6 months ago

pphuc25 commented 6 months ago

Hi, I find your project intriguing and believe it could greatly assist in working with multiple data sources. However, I noticed that you haven't mentioned how the vector data generated by your project can be utilized for downstream tasks, such as video captioning. Do you have any plans to address this aspect? I'd be interested to hear your ideas on how one could leverage your model for such tasks.

LinB203 commented 6 months ago

Yes, it makes a lot of sense to do some generative tasks! But we are currently focusing only on discriminative tasks like classification, retrieval. We don't plan video caption generation tasks at the moment because the key to that task is the data, and we've already proven the validity of our data. There are many sources for our dataset. As stated in the paper, we can use different data sources for different downstream tasks. Perhaps the generated data can be used to generate a more semantically correct caption, as some recent work such as BLIP has observed that the generated caption favors the model. By the way, in our internal experiments, mixing multiple generated data sources returned better results. Also, thanks to the fact that multiple modal data are aligned to the language, it might be a good direction to try multiple modal interactions!