选择四卡训练卡住

InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)

https://xtuner.readthedocs.io/zh-cn/latest/

Apache License 2.0

4k stars 314 forks source link

选择四卡训练卡住 #934

Open AlittlePIE opened 2 months ago

AlittlePIE commented 2 months ago

使用二卡训练正常，大于2的都会只加载2次模型，然后卡住，不进行训练

123yxh commented 2 months ago

使用二卡训练正常，大于2的都会只加载2次模型，然后卡住，不进行训练

你好，我想请问下，使用2卡训练报错是怎么回事，我的指令式是这个NPROC_PER_NODE=2 xtuner train test_myllama_train.py --deepspeed deepspeed_zero2；显示torch.distributed.elastic.multiprocessing.errors.ChildFailedError: