Closed fangqi-Zhu closed 2 days ago
Hey! I think it might be an issue with the hyperparameter settings. Could you try this set of hyperparameters?
# Freeze the multi modal projection layer
freeze_mm_proj: True
# Freeze the vison tower model
freeze_vision_tower: True
# Freeze the language model
freeze_language_model: False
According to our experiments, the training seems to be running quite well! We will provide a stable version as soon as possible.
Thank you for your response. I successfully achieved nearly 100% accuracy in DPO by freezing the vision model and projector layer. However, I’m still curious whether it is necessary to freeze the visual part in MMLM RLFH.
I attempted to freeze the visual part when training the PPO reward model, but it seems to have failed.
deepspeed \
--master_port ${MASTER_PORT} \
--module align_anything.trainers.text_image_to_text.rm \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--train_datasets ${TRAIN_DATASETS} \
--eval_datasets ${EVAL_DATASETS} \
--output_dir ${OUTPUT_DIR} \
--freeze_mm_proj True \
--freeze_vision_tower True \
--freeze_language_model False \
--train_split train \
--eval_split train \
--train_template RLAIFV \
--eval_template RLAIFV
Is there anything wrong? Thank you very much for your help!
Sorry for the late response! We have spent considerable effort and discovered a set of hyperparameters that can effectively improve the results:
# Freeze the multi modal projection layer
freeze_mm_proj: True
# Freeze the vison tower model
freeze_vision_tower: True
# Freeze the language model
freeze_language_model: False
In fact, due to the complexity of multimodal information, RM models that include visual inputs are often difficult to train. The accuracy is often lower than in cases with only text, which is also confirmed in the issue here.
We will continue to explore better hyperparameter settings, and we welcome any assistance from you and the community!
Due to the lack of response for an extended period, we are temporarily closing this issue. Feel free to reopen it at any time.
Required prerequisites
Questions
Hi everyone,
When I tried using the DPO algorithm to train LLaVA, I observed an abnormal increase in loss, but the reward accuracy was slightly above 0.5 after training and seemed to be continuously increasing. The reward margin was fluctuating upwards, with the better sample reward and worse sample reward oscillating with the same period. My biggest confusion is why the loss keeps increasing. Is there something wrong? I haven't changed any training code.
Loss figure
Reward figure
Reward margin figure
Better/Worse sample reward figure
Below is my training script. I am using 8 A100 GPUs for training with a batch size of 4 per GPU (unmodified):
Is there something wrong here, or is it normal for the loss to increase initially? I really appreciate your help.