Coobiw / MPP-LLaVA

Personal Project: MPP-Qwen14B & MPP-Qwen-Next(Multimodal Pipeline Parallel based on Qwen-LM). Support [video/image/multi-image] {sft/conversations}. Don't let the poverty limit your imagination! Train your own 8B/14B LLaVA-training-like MLLM on RTX3090/4090 24GB.
377 stars 20 forks source link

训练效果不佳,请大佬给点建议 #36

Closed thunder95 closed 1 month ago

thunder95 commented 1 month ago

用8卡4090训练后的效果输出内容以空白居多,甚至全空白。

请问sft阶段只训练了一个epoch吗? 因为容易显存oom,将投影层前面的部分先预处理了,就没有图像增强的,这么做对模型效果影响会很大吗? 针对这样的模型效果,训练有什么改善建议。

感谢大佬!

thunder95 commented 1 month ago

[2024-09-24 21:50:37,200] [INFO] [logging.py:96:log_dist] [Rank 0] step=1990, skipped=0, lr=[1.8428787835578222e-09], mom=[[0.9, 0.99]] [2024-09-24 21:50:37,243] [INFO] [engine.py:407:train_batch] steps: 1990 loss: 1.1562 iter time (s): 16.314 samples/sec: 7.846 steps: 1990 loss: 1.1562 iter time (s): 16.314 samples/sec: 7.846 Totalsteps: 1999, step = 1989, time = 16.309285879135132, loss = 1.15625, lr=1.5230484360873043e-09 Step= 1989, lr=1.5230484360873043e-09, loss=1.1625, 16.31 it/s Totalsteps: 1999, step = 1990, time = 16.31498956680298, loss = 1.171875, lr=1.233675183394123e-09 Totalsteps: 1999, step = 1991, time = 16.314311742782593, loss = 1.1484375, lr=9.74759906957612e-10 Totalsteps: 1999, step = 1992, time = 16.31598472595215, loss = 1.21875, lr=7.463033954802079e-10 Totalsteps: 1999, step = 1993, time = 16.30734133720398, loss = 1.140625, lr=5.483063448785686e-10 Totalsteps: 1999, step = 1994, time = 16.314671516418457, loss = 1.1640625, lr=3.807693582869032e-10 Totalsteps: 1999, step = 1995, time = 16.314774751663208, loss = 1.234375, lr=2.436929460525317e-10 Totalsteps: 1999, step = 1996, time = 16.314181566238403, loss = 1.1953125, lr=1.3707752573255406e-10 Totalsteps: 1999, step = 1997, time = 16.312156677246094, loss = 1.140625, lr=6.092342209607083e-11 Totalsteps: 1999, step = 1998, time = 18.596952199935913, loss = 1.09375, lr=1.5230867123072757e-11 Saving at step 1998

thunder95 commented 1 month ago

在llm_proj投影层前,训练和推理的张量是能对齐的, pretrain阶段的权重能分享下吗?

Coobiw commented 1 month ago

pretrain阶段的权重已经open过了 可以在readme里自取 输出空白不太应该 你可以先在train set里抽少量样本 然后多训几个epoch(比如5个),然后就在这个小的train subset上测试,看看能不能过拟合,如果能的话那就没啥问题,考虑调下其他超参数,比如训2-3个epoch、调下lr之类的

thunder95 commented 1 month ago

@Coobiw 按您的介意,训了5个epoch,测试的train里的数据,仍旧输出为空。generate的时候第一个token总是<|im_start|>,大概率是我的训练数据有问题吧

Coobiw commented 1 month ago

这样的话基本就是两个问题:

  1. 你没有用chat模板,直接调用了generate(参考下cli_demo.py去改你的生成)
  2. 你的数据构建有问题,格式不对
thunder95 commented 1 month ago

@Coobiw 用的webui demo,所以chat模板没有问题,用训练数据硬编码加载去推理官方模型没问题,但是推自己模型一直回复QA(不管问题和图片是什么都回复这个), token id是47522。 在训练数据里也没有发现这个token。 检查了数据格式没发现什么问题,其实我就只是把visionpipe推理的先存下来,然后直接替换。

thunder95 commented 1 month ago

只用了vidochatgpt数据,训练了5个epoch,然后损失降低到了0.05(低得很奇怪)

Coobiw commented 1 month ago

训5个epoch的话 loss低挺正常的 你可以看看wandb的loss curve 应该每个epoch下降都很明显 然后我问下 你的帧数和seq_length设置大概是怎么样的

另外,你用videochatgpt做continue train,后面推理也会有上述问题吗?

thunder95 commented 1 month ago

多谢耐心指导。 帧数是指采样的帧数吗,这些都是默认的32, seq_length是指max_txt_len参数吧,是1536, 没有改。continue train, 是指finetuned模型改成unfreeze_llm_model.pth吗, 我先试试。 @Coobiw