TempleX98 / MoVA

[NeurIPS 2024] MoVA: Adapting Mixture of Vision Experts to Multimodal Context
Apache License 2.0
132 stars 2 forks source link

The Training Data Pipeline #10

Open lixu6-alt opened 2 weeks ago

lixu6-alt commented 2 weeks ago

Dear authors:

MoVA is a really impressive work! I am working on a similar idea of using the text instruction to guide the fusion of image tokens in MLLMs. However, I encountered an issue thesedays: the LLaVA-665K finutuning dataset contains a lot of multi-turn conversations which means one sample can involve multiple instructions . In this case, do we need to split each multi-turn conversation sample into multiple single-turn conversation samples (since we can only encode one text instruction for one sample in a forward computation)?

Thanks!

TempleX98 commented 1 day ago

During training, we keep the original data format and directly concatenate these multi-round questions into a single question for instruction-aware extraction.