InternLM / InternLM-XComposer

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
2.06k stars 127 forks source link

How to use Wanjuan dataset #216

Open alexwangmac opened 3 months ago

alexwangmac commented 3 months ago

Hi, thank you for your amazing work. I would like to ask about the usage of "Wanjuan." I noticed in the article that you used the Wanjuan dataset during the pretraining phase. This dataset contains a mix of text and images, which is different from the format of VQA data. I'm curious to know how it was incorporated into the training process. Do you need to construct question prompts similar to VQA, such as "Given the context and the image, continue the passage"? Additionally, the content in Wanjuan dataset seems quite diverse and only partially related to the images. Does training with this dataset pose a significant challenge or potential conflicts with other data, resulting in suboptimal performance?