WisconsinAIVision / ViP-LLaVA

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
https://vip-llava.github.io/
Apache License 2.0
214 stars 15 forks source link

[Question] Finetune Stage 2 Model #15

Open Xuefei98 opened 2 months ago

Xuefei98 commented 2 months ago

Question

First of all, great work and thank you so much for open-source it! I wonder if the stage 2 model(referred as ViP-LLaVA-Base) has been released anywhere? Maybe mucai/vip-llava-13b-pretrain? I am trying to finetune the stage 2 model using custom GPT instruction data. I am looking at scripts/finetune_stage3.sh and wonder if that's the correct script? But model used in the script is ./checkpoints/vip-llava-$model_size-stage2-ft and I dont really see it anywhere. Thank you!

mu-cai commented 2 months ago

Hi xuefei,

Thanks for bringing this point! I just uploaded the 7B stage 2 model: https://huggingface.co/mucai/vip-llava-7b-base

Mu

Xuefei98 commented 2 months ago

Hi Mu,

Thank you so much for getting back to me! Is it possible for you to also share the 13B model? I would like to fine tune both 7B and 13B model and compare the performance for my experiments.

Xuefei

mu-cai commented 2 months ago

You can now find 13b base model here! https://huggingface.co/mucai/vip-llava-13b-base