SaraGhazanfari / EMMA

Apache License 2.0
7 stars 1 forks source link

Request for Clarification for Text Tokens on Figure 5a and the performance compared with LLaVA under the same dataset setting? #1

Open EchoDreamer opened 1 month ago

EchoDreamer commented 1 month ago

Great work! I’d like to ask about two details in this paper:

Does the text token in the legend of Figure 5a include the system prompt like "The assistant gives helpful, detailed, and polite answers to the human's questions"? If so, it seems that the focus is more on the system prompt rather than the specific instruction.

How does the performance compare when the training data is the same as that of LLaVA?

SaraGhazanfari commented 1 month ago

Thanks for your interest in our work!

Does the text token in the legend of Figure 5a include the system prompt?

No, the textual input to the Modality Adaptation module is the raw prompt without the added tokens required for the LLM.

How does the performance compare when the training data is the same as that of LLaVA?

All the analysis presented in the Method section (including Figure 3) has been performed on the EMMA model which is trained on the same dataset of LLaVA (as mentioned in the paper).

EchoDreamer commented 4 weeks ago

I would like to ask another question: Has there been an attempt to train the LLava 1.5 using all 1.8M data without modifying the architecture? If so, how effective was it? Thanks! Besides, Are you planning to release any additional processed datasets(LVIS-Instruct4V,CLEVR, VizWiz, ScienceQA)