InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
4k stars 314 forks source link

选择四卡训练卡住 #934

Open AlittlePIE opened 2 months ago

AlittlePIE commented 2 months ago

使用二卡训练正常,大于2的都会只加载2次模型,然后卡住,不进行训练

123yxh commented 2 months ago

使用二卡训练正常,大于2的都会只加载2次模型,然后卡住,不进行训练

你好,我想请问下,使用2卡训练报错是怎么回事,我的指令式是这个NPROC_PER_NODE=2 xtuner train test_myllama_train.py --deepspeed deepspeed_zero2;显示torch.distributed.elastic.multiprocessing.errors.ChildFailedError: