Questions about change ViT to 378 input resolution, but got poor results.

dvlab-research / MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"

Apache License 2.0

3.22k stars 280 forks source link

Open OpenJarvisAI opened 7 months ago

OpenJarvisAI commented 7 months ago

Hi, am alreaady tried using vit336 and convnext + Qwen LLM, which is great, and really got a good performance.

But when I try using another CLIP vit model with input size is 378, rest things are same (include traning data) the result are extremly poor.

To precisely:

the loss are lower, normally I got 0.9-1.0 , but using CLIP with input size 378, the loss can to 0.7-0.8, but the inference result are very poor;
The CLIP model I used was Apple's DNFS_vit_G_378 model.
I have changed the convnext input resuoltion accordingly.

Any reason for this? This is really weired, better and larger ViT got bad results.

hhaAndroid commented 7 months ago

yanwei-li commented 7 months ago

Hi, thanks for your report. If there are no other bugs, I guess you can try the following steps to locate the problem:

If the performance is quite low (with over 10% performance drop), there may be some bugs in the implementation.
Only apply DNFS_vit_G_378 without patch info mining to see whether the performance is satisfactory.
If previous models are all good, try to use larger ConvNext, like CLIP-convnext_xxlarge. Because better ViT requires stronger ConvNext to provide candidate key and value for reference.

OpenJarvisAI commented 7 months ago

Hi, am start to doubt Appl'es VIT is right or not, seems they just randomly post wrong weights....

meanwhile, do u have any condicates to used Vit-H or VitBigG?

yanwei-li commented 7 months ago

Hi, we donot plan to use larger ViT to retrain the model. Because it could exceed our current resources.

OpenJarvisAI commented 7 months ago

@yanwei-li Hi, which kind of path are u guys currently work on to enhance even more better performance of MGM?