Strange Loss Trend - Githubissues

InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)

https://xtuner.readthedocs.io/zh-cn/latest/

Apache License 2.0

3.94k stars 308 forks source link

Closed Echo0125 closed 3 months ago

Echo0125 commented 4 months ago

我在xtuner中复现了llava-next video并使用image-video混合数据集训练，在相同数据下，llava官方框架的loss下降趋势就很平滑，但是xtuner则是下面这种趋势:

我为image和video设置了不同的modality_length，一个batch应该是混合的，为什么会这样呢？

hhaAndroid commented 4 months ago

可以把代码提交下，或者微信聊一下，可以共同l来分析下，如果有可能后续可以沉淀到 xtuner main 分支。

但是有个点要确定： 1. mmengine 打印的 loss 是10个 step 内的平均值， 2. llava 官方的 loss 是全局滑动平均值所以 llava 官方 Loss 很平滑是非常正常的。所以在排除代码 bug 情况下应该是看性能，loss 曲线输出不一样，没法公平对比

Echo0125 commented 4 months ago

嗯嗯，等模型训练完成我看一下性能~

Echo0125 commented 3 months ago

模型性能与原来相比持平，应该没有影响~ 但是遇到了新的问题，用相同的数据训练llama-3b，保证batch_size * accumulative_counts为16，会出现nan，请问这个如何排查呢~

Echo0125 commented 3 months ago

Done. 看来这是zero2的bug