QwenLM / Qwen-VL

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.
Other
4.86k stars 372 forks source link

Questions about the data usage in Multi-task Pretraining stage #89

Open yuezewang opened 1 year ago

yuezewang commented 1 year ago

Hi, thanks for your amazing work! May I ask you the insight or perspective about the usage of image-text interleaved datasets (rather than manually packed)? In other words, do you consider using naturally interleaved image-text data? Or, there is a problem in the usage of them? Looking forward to your concern and reply, thanks~

ShuaiBai623 commented 11 months ago

Yes, manually packed data is used to better train the model for few-shot learning. Of course, natural interleaved image-text data is also necessary, especially for image-text association and image-image understanding. We are also exploring this aspect.