Closed eddyrivers10 closed 9 months ago
Hi @eddyrivers10, we did not scale up LLaVA-v1.5 further with full SVIT. The main reason is that we think it is also important to balance different types of data. The fine-tuning data of LLaVA-v1.5 contain 665K samples in total. 158K of them are generated by GPT-4 (LLaVA-150K), which serve a similar purpose as SVIT, while the rest are language-only dataset (ShareGPT) and data constructed with traditional captioning/VQA/referring datasets. The distribution brings in greater diversity in terms of tasks, formats, etc. If we scale the dataset up with full SVIT - which has millions of samples in total, while keeps the rest data unchanged, the balance will be broken. A possible solution would be scaling up SVIT and other datasets at the same time. We are also collecting data from traditional datasets and will update if we figure out a better data recipe.
The checkpoints are currently available: (full, lora). You can find the loss values in "trainer_state.json".
Hi I saw in the paper you only scaled up the visual instruction tuning for llava 1 and not 1.5. Have you done any test on v1.5 and what is the result like? I think this is more important as an ablation than testing on llava 1 because llava 1.5 already uses a much larger training dataset than llava 1, so in actual fact the authors already tested "scaling up" in the new llava 1.5 paper. It will be very interesting if you can share what the results are like when you scale up llava v1.5 further with full SVIT.
Do you have the wandb/training loss values too? I am trying to see whether training llava v1.5 on this should work as expected. I see some people mentioned the loss is higher, but sometimes loss is not the full story. Thanks!
https://github.com/BAAI-DCAI/Visual-Instruction-Tuning/issues/11