InternLM / InternLM-XComposer

InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension.
1.91k stars 120 forks source link

Can you provide some examples of which to choose between InternLM-XComposer2 or InternLM-XComposer2-VL? #327

Open cocoshe opened 3 weeks ago

cocoshe commented 3 weeks ago

Thanks for your great work! I'm really interested in your work and have some questions here~

InternLM-XComposer2-VL-7B 🤗 : The multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for VL benchmarks and AI assistant. It ranks as the most powerful vision-language model based on 7B-parameter level LLMs, leading across 13 benchmarks.

InternLM-XComposer2-7B 🤗: The further instruction tuned VLLM for Interleaved Text-Image Composition with free-form inputs.

The InternLM-XComposer2-7B obviously indicates the format like:

Hello! <imageA> What's in the image, and in <imageB>,  what can you see?

It's the free-form inputs.

But when to use the InternLM-XComposer2-VL-7B? With the image-only inputs with some benchmarks, for image comprehension without text inputs? I'm confused about the input format.

So, the questions here:

  1. The input format for InternLM-XComposer2 and InternLM-XComposer2-VL, especially for the VL version, can you give some situations?
  2. The boundary between the two versions of InternLM-XComposer2, if I select the wrong version, could it greatly damage the result? (for example, use InternLM-XComposer2 on benchmark testing, or use the VL version on an interleaved situation)
  3. Is the InternLM-XComposer2 training based on InternLM-XComposer2-VL with extra interleaved data for instruct tuning? And how to train the VL version of InternLM-XComposer2?

Thanks for your great work again~