Closed siddk closed 1 year ago
Hi @siddk, thanks for your interest in this work!
The initial release of SVIT contains three subsets, namely, complex_reasoning
, detail_description
and conversation
. We validate the dataset's effectiveness by replacing LLaVA-Instruct-150K. In detail, we finetune the pretrained weights of LLaVA Vicuna-7B 1.0.
After that, we further add referring_qa
subset to help enhance the model's referring abilities. We validate the dataset's effectiveness by adding a new finetuning stage on LLaVA. In detail, we finetune the final weights of LLaVA LLaMA-2 7B.
When finetuning LLaVA, we train for one epoch on the training set of subsets mentioned above. But we also notice that equally sampling data from each subsets, e.g., 100K, 200K from each subset may get a better performance than using the full set. We are also investigating about it and trying to build a "core" set.
Feel free to point it out if I miss anything!
Thanks so much @BoyaWu10; training on 1M examples feels like it probably took a long time! Do you have a rough estimate of how long it took to run the referring_qa
finetuning (and on what hardware)?
I started finetuning of LLaVA LLaMA-7B with the training set of referring_qa
just now to see how long it takes. The estimated time reported by LLaVA is around 6-7 hours, on 8 x A100 40G, with LLaVA's default deepspeed zero3 configuration. The training set is 90% of the full set, so if we finetune on the full set of referring_qa
, it would take a bit longer than that.
Awesome; and this is with global batch size 128, and the same learning rate (2e-5)?
This is incredible work! We're trying to integrate the SVIT data into our VLM training pipeline, but wanted to clarify some details around the LLaVa finetuning protocol.
From the paper, it seems that:
You take the LLaVa LLaMa-2 7B Chat "align" checkpoint (just the trained linear projector), then run two stages of finetuning.
- In the first stage, you train (full finetuning of LLaMa and the projector) on the
complex_reasoning
,detail_description
andconversation
subsets of the SVIT data.- In the second stage, you train (full finetuning of LLaMa and the projector) only on the
referring_qa
subset of the SVIT data.- Is that correct?
- When finetuning LLaVa, do you train for one epoch on the entirety of the SVIT data (much much bigger than the LLaVa 150K dataset), or do you subsample the dataset and run for some fixed number of gradient steps?
Again, really cool work, but would love this extra detail so we can integrate SVIT properly!
Hi @siddk Thanks for your efforts! Have you ever tried finetuning LLaVA on this dataset:
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
It has 300k good-quality instruction data. I have ever finetuned minigpt4 and mplug-owl but failed to fine-tune llava.
Hey @FuxiaoLiu - yes I think the LRV Instruction data is great! We haven't incorporated it into our finetuning pipeline just yet, but happy to share results when we do!
Awesome; and this is with global batch size 128, and the same learning rate (2e-5)?
Yes, the global batch size and learning rate are set as default.
Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.
This is incredible work! We're trying to integrate the SVIT data into our VLM training pipeline, but wanted to clarify some details around the LLaVa finetuning protocol.
From the paper, it seems that:
You take the LLaVa LLaMa-2 7B Chat "align" checkpoint (just the trained linear projector), then run two stages of finetuning.
complex_reasoning
,detail_description
andconversation
subsets of the SVIT data.referring_qa
subset of the SVIT data.When finetuning LLaVa, do you train for one epoch on the entirety of the SVIT data (much much bigger than the LLaVa 150K dataset), or do you subsample the dataset and run for some fixed number of gradient steps?
Again, really cool work, but would love this extra detail so we can integrate SVIT properly!