research about a model video captioning

PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

MIT License

722 stars 52 forks source link

Yes, it makes a lot of sense to do some generative tasks! But we are currently focusing only on discriminative tasks like classification, retrieval. We don't plan video caption generation tasks at the moment because the key to that task is the data, and we've already proven the validity of our data. There are many sources for our dataset. As stated in the paper, we can use different data sources for different downstream tasks. Perhaps the generated data can be used to generate a more semantically correct caption, as some recent work such as BLIP has observed that the generated caption favors the model. By the way, in our internal experiments, mixing multiple generated data sources returned better results. Also, thanks to the fact that multiple modal data are aligned to the language, it might be a good direction to try multiple modal interactions!

PKU-YuanGroup / LanguageBind

research about a model video captioning #8