How to use Wanjuan dataset

Hi, thank you for your amazing work. I would like to ask about the usage of "Wanjuan." I noticed in the article that you used the Wanjuan dataset during the pretraining phase. This dataset contains a mix of text and images, which is different from the format of VQA data. I'm curious to know how it was incorporated into the training process. Do you need to construct question prompts similar to VQA, such as "Given the context and the image, continue the passage"? Additionally, the content in Wanjuan dataset seems quite diverse and only partially related to the images. Does training with this dataset pose a significant challenge or potential conflicts with other data, resulting in suboptimal performance?

InternLM / InternLM-XComposer

How to use Wanjuan dataset #216