MoVA is a really impressive work! I am working on a similar idea of using the text instruction to guide the fusion of image tokens in MLLMs. However, I encountered an issue thesedays: the LLaVA-665K finutuning dataset contains a lot of multi-turn conversations which means one sample can involve multiple instructions . In this case, do we need to split each multi-turn conversation sample into multiple single-turn conversation samples (since we can only encode one text instruction for one sample in a forward computation)?
During training, we keep the original data format and directly concatenate these multi-round questions into a single question for instruction-aware extraction.
Dear authors:
MoVA is a really impressive work! I am working on a similar idea of using the text instruction to guide the fusion of image tokens in MLLMs. However, I encountered an issue thesedays: the LLaVA-665K finutuning dataset contains a lot of multi-turn conversations which means one sample can involve multiple instructions . In this case, do we need to split each multi-turn conversation sample into multiple single-turn conversation samples (since we can only encode one text instruction for one sample in a forward computation)?
Thanks!