BAAI-DCAI / Visual-Instruction-Tuning

SVIT: Scaling up Visual Instruction Tuning
MIT License
163 stars 4 forks source link

Clarification on LLaVa Finetuning Scheme #7

Closed siddk closed 1 year ago

siddk commented 1 year ago

This is incredible work! We're trying to integrate the SVIT data into our VLM training pipeline, but wanted to clarify some details around the LLaVa finetuning protocol.

From the paper, it seems that:

Again, really cool work, but would love this extra detail so we can integrate SVIT properly!

BoyaWu10 commented 1 year ago

Hi @siddk, thanks for your interest in this work!

The initial release of SVIT contains three subsets, namely, complex_reasoning, detail_description and conversation. We validate the dataset's effectiveness by replacing LLaVA-Instruct-150K. In detail, we finetune the pretrained weights of LLaVA Vicuna-7B 1.0.

After that, we further add referring_qa subset to help enhance the model's referring abilities. We validate the dataset's effectiveness by adding a new finetuning stage on LLaVA. In detail, we finetune the final weights of LLaVA LLaMA-2 7B.

When finetuning LLaVA, we train for one epoch on the training set of subsets mentioned above. But we also notice that equally sampling data from each subsets, e.g., 100K, 200K from each subset may get a better performance than using the full set. We are also investigating about it and trying to build a "core" set.

Feel free to point it out if I miss anything!

siddk commented 1 year ago

Thanks so much @BoyaWu10; training on 1M examples feels like it probably took a long time! Do you have a rough estimate of how long it took to run the referring_qa finetuning (and on what hardware)?

BoyaWu10 commented 1 year ago

I started finetuning of LLaVA LLaMA-7B with the training set of referring_qa just now to see how long it takes. The estimated time reported by LLaVA is around 6-7 hours, on 8 x A100 40G, with LLaVA's default deepspeed zero3 configuration. The training set is 90% of the full set, so if we finetune on the full set of referring_qa, it would take a bit longer than that.

siddk commented 1 year ago

Awesome; and this is with global batch size 128, and the same learning rate (2e-5)?

FuxiaoLiu commented 1 year ago

This is incredible work! We're trying to integrate the SVIT data into our VLM training pipeline, but wanted to clarify some details around the LLaVa finetuning protocol.

From the paper, it seems that:

  • You take the LLaVa LLaMa-2 7B Chat "align" checkpoint (just the trained linear projector), then run two stages of finetuning.

    • In the first stage, you train (full finetuning of LLaMa and the projector) on the complex_reasoning, detail_description and conversation subsets of the SVIT data.
    • In the second stage, you train (full finetuning of LLaMa and the projector) only on the referring_qa subset of the SVIT data.
    • Is that correct?
  • When finetuning LLaVa, do you train for one epoch on the entirety of the SVIT data (much much bigger than the LLaVa 150K dataset), or do you subsample the dataset and run for some fixed number of gradient steps?

Again, really cool work, but would love this extra detail so we can integrate SVIT properly!

Hi @siddk Thanks for your efforts! Have you ever tried finetuning LLaVA on this dataset:

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

It has 300k good-quality instruction data. I have ever finetuned minigpt4 and mplug-owl but failed to fine-tune llava.

siddk commented 1 year ago

Hey @FuxiaoLiu - yes I think the LRV Instruction data is great! We haven't incorporated it into our finetuning pipeline just yet, but happy to share results when we do!

BoyaWu10 commented 1 year ago

Awesome; and this is with global batch size 128, and the same learning rate (2e-5)?

Yes, the global batch size and learning rate are set as default.

BoyaWu10 commented 1 year ago

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.