The Training Data Pipeline

Dear authors:

MoVA is a really impressive work! I am working on a similar idea of using the text instruction to guide the fusion of image tokens in MLLMs. However, I encountered an issue thesedays: the LLaVA-665K finutuning dataset contains a lot of multi-turn conversations which means one sample can involve multiple instructions . In this case, do we need to split each multi-turn conversation sample into multiple single-turn conversation samples (since we can only encode one text instruction for one sample in a forward computation)?

Thanks!

TempleX98 / MoVA

The Training Data Pipeline #10