InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.8k stars 302 forks source link

选择四卡训练卡住 #934

Open AlittlePIE opened 1 week ago

AlittlePIE commented 1 week ago

使用二卡训练正常,大于2的都会只加载2次模型,然后卡住,不进行训练

123yxh commented 5 days ago

使用二卡训练正常,大于2的都会只加载2次模型,然后卡住,不进行训练

你好,我想请问下,使用2卡训练报错是怎么回事,我的指令式是这个NPROC_PER_NODE=2 xtuner train test_myllama_train.py --deepspeed deepspeed_zero2;显示torch.distributed.elastic.multiprocessing.errors.ChildFailedError: