SunzeY / AlphaCLIP

[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
https://aleafy.github.io/alpha-clip
Apache License 2.0
598 stars 33 forks source link

Question: Can you provide some guidance for finetuning MLLM with alpha-clip vision encoder? #24

Closed XuRui314 closed 6 months ago

XuRui314 commented 6 months ago

Fine-tuning alpha-clip LLaVA-1.5 is mentioned in paper 4.2. Alpha-CLIP in MLLM. I wonder the detail settings. Do you train the model just following stage 1, stage 2 in Paper GPT4ROI?

And if i want to train a general version of Alpha-CLIP MLLM, can you provide some guidance for fine tuning? Should we completely follow the settings of LLaVA-1.5 to train MLP and LLM from scratch, or should we just fine-tune the MLP layer using part of the data to align the original CLIP space and Alpha-CLIP space?

SunzeY commented 6 months ago

We believe Alpha-CLIP is already well aligned with CLIP space. In 4.2 experiment. the setting is as follows:

  1. first stage: We follow exactly the same setting of LLaVA. training the random initialized mlp pojector with original CLIP, this stage have nothing to do with Alpha-CLIP
  2. second stage. We replace original CLIP with Alpha-CLIP with the mlp projector pretrained from first stage, and finetune llama on RefCOCO or VisialGenome directly. Alpha-CLIP and mlp projector are remained frozen.

In the code level. As we adopt LLaVA training code, this second stage only involve:

  1. replace 665k finetuning image text pairs with RefCOCO or VG as training data.
  2. rewrite CLIP forward methods and take additional GT bounding box(rectangular area in alpha-map) as the fourth channel input.

All the other code is same as original LLaVA.

XuRui314 commented 6 months ago

Thanks for sharing