InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.64k stars 297 forks source link

RuntimeError: Rank 2 successfully reached monitoredBarrier, but received errors while waiting for send/recv from rank 0. Please check rank 0 logs for faulty rank. #457

Open lesjie-wen opened 5 months ago

lesjie-wen commented 5 months ago

8卡finetune llava-interlm2时遇到错误

image image

环境

8卡 l40,internlm2 1.8B,

LZHgrla commented 5 months ago

@lesjie-wen , Hi!

从log来看,有两个方法可以尝试一下:

  1. 判断一下数据处理时间是否超过了30分钟(从log来看只用了~15分钟,但建议还是检查一下)。xtuner默认会在数据处理超过30分钟后强制退出,以避免某些未知错误。用户可以通过设置环境变量XTUNER_DATASET_TIMEOUT来改变这一timeout 分钟数,例如XTUNER_DATASET_TIMEOUT=120 xtuner train xxx
  2. 如果不符合上述情况1,那么可以考虑是在数据处理阶段发生了内存的OOM,可以监控一下数据处理阶段时内存的变化。
lesjie-wen commented 5 months ago

很感谢您的回复,我会尝试一下这两种情况