dvlab-research / MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Apache License 2.0
3.19k stars 279 forks source link

About the idea to further enhance the performance. #78

Open lucasjinreal opened 5 months ago

lucasjinreal commented 5 months ago

Hi, I have conducted experiment minigemini arch to Qwen series model, it has a good performance.

However, the performance didn't strong enough compare to some SOTA small models such as MiniCPMv2 LLavaUHD etc.

Which used a very large input and slicing technology.

As such, am just wonder, how can we further pushing the boundry of mini-gemini, and make mini-gemini great again?

The currently baseline I got from qwen7b is slightly same as gemma7b's on MMMU, but this is actually not very satisfying.

Here are some thoughts to further improve on my mind:

  1. Make larger input resolution on clip-vit, since the final token num are determined by this one, I have tried enlarge 336 -> 448, result in 1024 visual tokens, but surprisely, the result got worser;
  2. Add a Resampler replace MLP, I tried, the loss didn't get converge.

So here is I want talk about: How should we exactly make some improvement?

Hoping for your discussion and insights, guid me on the right path.

DePengW commented 5 months ago

I also changed llm to qwen1.5, and the performance will be somewhat improved.

  1. In the encoder section, I replaced it with deepseekvl's hybrid and there was also a small increase.
  2. In the data section, the allava data itself is somewhat dirty, cleared a wave, and then manually translated the Chinese version, and added some interlm-xc data, and it was also increased a bit.

Feel that our directions are very similar, if interested, you can leave the contact information to communicate.

OpenJarvisAI commented 5 months ago

Allava had a Chinese version. What do u mean by deepseek's hybrid? Minigemini already a hybrid arch. interlm-xcomposer data could be even more dirty. the sharegpt4v dataset should already be included in.

DePengW commented 5 months ago

There is a Chinese version of Allava, but both the Chinese and English versions are dirty. In the Chinese version of allava, there are many phenomena of picture-text mismatch, translation dislocation and translation hallucination. For example, grep “宁静湖畔” in allava-cn , the result is a high probability of picture and text mismatch. Therefore, it is necessary to clean allava-en and allava-cn, and the addition of allava-cn can also bring about the improvement of indicators.

Minigemini has a mixed structure, but after the experiment, deepseek-vl will be slightly better.

Interlm-xcomposer data, I specifically refer to the sft phase, such as aokvqa, okvqa, lvis data

OpenJarvisAI commented 5 months ago

How did u clean allava data and manually translate to Chinese version? Would share the data after shared? that would be very nice. Also, does internxcomposer opensourced their sft data?