Can you provide some examples of which to choose between InternLM-XComposer2 or InternLM-XComposer2-VL?

Thanks for your great work! I'm really interested in your work and have some questions here~

InternLM-XComposer2-VL-7B 🤗 : The multi-task trained VLLM model with InternLM-7B as the initialization of the LLM for VL benchmarks and AI assistant. It ranks as the most powerful vision-language model based on 7B-parameter level LLMs, leading across 13 benchmarks.

InternLM-XComposer2-7B 🤗: The further instruction tuned VLLM for Interleaved Text-Image Composition with free-form inputs.

The InternLM-XComposer2-7B obviously indicates the format like:

Hello! <imageA> What's in the image, and in <imageB>,  what can you see?

It's the free-form inputs.

But when to use the InternLM-XComposer2-VL-7B? With the image-only inputs with some benchmarks, for image comprehension without text inputs? I'm confused about the input format.

So, the questions here:

The input format for InternLM-XComposer2 and InternLM-XComposer2-VL, especially for the VL version, can you give some situations?
The boundary between the two versions of InternLM-XComposer2, if I select the wrong version, could it greatly damage the result? (for example, use InternLM-XComposer2 on benchmark testing, or use the VL version on an interleaved situation)
Is the InternLM-XComposer2 training based on InternLM-XComposer2-VL with extra interleaved data for instruct tuning? And how to train the VL version of InternLM-XComposer2?

Thanks for your great work again~

InternLM / InternLM-XComposer

Can you provide some examples of which to choose between InternLM-XComposer2 or InternLM-XComposer2-VL? #327