Closed StarCycle closed 4 months ago
Btw it will be great if you can share the loss curve in the report!
I think the biggest problem of LLaVA based models is avoiding overfitting, since they only use limited training data to align vision encoder and llm.
In MiniCPM, they can train a series of small models to find the best scaling law and hyperparameters. But when you are finetuning some open-source components, there are not smaller and equivalent components.
One possible solution is to trying a series of hyperparameters and check which is better in the final evaluation (after finishing 1 epoch pretrain and 1 epoch finetuning). But it's quite expensive...
Or you can let the model converge in the pretraining phase (no overfitting problem because the number of parameters is small, and the llm is frozen), and then make it not converge in the finetuning phase (you are modifying the LLM so avoid overfit carefully)?
Sorry for the late reply.
For Q1, By "fully tuning across all combinations of model architectures", we mean finetuning just projector + phi2. And for the influence of LoRA, it's more a empirical finding that LoRA conveys better performance. You can see similar findings in Idefics2 tech report.
For Q2, We almost don't select a set of hyperparameters which makes the network fully converge. We conduct some experiments of learning rate and observe the loss curve and downstream performance. BTW, the output loss of trainer is step-wise. So, loss fluctuation around some level is normal given that the model sees every sample only once. Recently, some LLaVA variants conduct 2 epoch in sft stages (mipha, xtuner-phi3, Prismatic VLMs).
For the loss curve of Bunny, you can find the loss log in trainer_state.json
in the related HuggingFace repo of each model.
For example, https://huggingface.co/BAAI/bunny-phi-2-siglip-lora/blob/main/trainer_state.json
Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.
Hello,
It's a great work! And there are several questions:
By " fully tuning across all combinations of model architectures", do you mean finetune the SigLIP encoder + projector + phi2, or just projector + phi2? And why LoRA tuning can alleviate catastrophic forgetting (sorry I am not familiar with this...)? Note that in this paper, using LoRA cannot avoid the model overfitting the finetuning dataset.
To avoid overfitting, it seems that researchers only train LLaVA for one epoch (both the pretrain and finetuning phase). Therefore, the loss curve may not converge to the lowest point.
For example, this is my loss curve and learning rate schedule during the pretrain phase:
And the loss curve and lr schedule in the finetune phase:
I guess the network does not converge at all...so how do you determine these hyperparameters? Do you select a set of hyperparameters which makes your network fully converges? Or you just select a set of hyperparameters which has the best benchmark performance?
Best, Starcycle