Open EchoDreamer opened 1 month ago
Thanks for your interest in our work!
Does the text token in the legend of Figure 5a include the system prompt?
No, the textual input to the Modality Adaptation module is the raw prompt without the added tokens required for the LLM.
How does the performance compare when the training data is the same as that of LLaVA?
All the analysis presented in the Method section (including Figure 3) has been performed on the EMMA model which is trained on the same dataset of LLaVA (as mentioned in the paper).
I would like to ask another question: Has there been an attempt to train the LLava 1.5 using all 1.8M data without modifying the architecture? If so, how effective was it? Thanks! Besides, Are you planning to release any additional processed datasets(LVIS-Instruct4V,CLEVR, VizWiz, ScienceQA)
Great work! I’d like to ask about two details in this paper:
Does the text token in the legend of Figure 5a include the system prompt like "The assistant gives helpful, detailed, and polite answers to the human's questions"? If so, it seems that the focus is more on the system prompt rather than the specific instruction.
How does the performance compare when the training data is the same as that of LLaVA?