全量微调vision端不稳定

128Ghe980 commented 5 days ago

我的训练方式是分两个阶段

冻结vision端（vision+resampler），只微调llm
冻结llm，微调vision
全部微调

但现在loss曲线很差，请问是什么问题呢

数据集为数学相关，输入题目和图片，输出题目关键点

以下为bash文件中的设置：

per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1

--tune_vision False \
  --tune_resampler False \
  --model_name_or_path $MODEL \
  --llm_type $LLM_TYPE \
  --data_path $DATA \
  --remove_unused_columns False \
  --label_names "labels" \
  --prediction_loss_only False \
  --bf16 True \
  --bf16_full_eval False \
  --fp16 False \
  --fp16_full_eval False \
  --tf32 True\
  --do_train True\
  --do_eval False\
  --tune_llm True \
  --model_max_length 10240 \
  --max_slice_nums 9 \
  --num_train_epochs 1 \
  --output_dir $output_dir \
  --logging_dir $output_dir \
  --logging_strategy "steps" \
  --logging_steps 1 \
  --per_device_train_batch_size $per_device_train_batch_size \
  --per_device_eval_batch_size $per_device_eval_batch_size \
  --gradient_accumulation_steps $gradient_accumulation_steps \
  --evaluation_strategy "no" \
  --save_strategy steps \
  --save_steps 2000 \
  --save_total_limit 3 \
  --learning_rate 5e-6 \
  --weight_decay 0.1 \
  --adam_beta2 0.95 \
  --warmup_ratio 0.01 \
  --gradient_checkpointing True \
  --deepspeed $deepspeed_config \
  --report_to "wandb" \
  --run_name $RUN_NAME \

不同阶段会修改--llm_tune等设置，其他不变

以下为loss曲线