InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
3.91k stars 305 forks source link

使用维护更好的mpi4py依赖 #804

Open 91he opened 3 months ago

91he commented 3 months ago

根据 https://github.com/mpi4py/mpi4py/issues/463 https://github.com/mpi4py/mpi4py/issues/508所述,mpi4py-mpich并没有被很好的维护(且不支持linux aarch64架构),建议将此依赖变更为mpi4py。

idontlikelongname commented 2 months ago

附议,目前在使用mpi4py-mpich时已经遇到了问题,非分布式启动任务时报错。

python ./xtuner/tools/train.py llava_v15_7b_pretrain deepspeed_zero2 --seed 42

报错信息

08/29 09:56:45 - mmengine - INFO - Dispatch LlamaFlashAttention2 forward. Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
[2024-08-29 09:57:01,183] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.0, git-hash=unknown, git-branch=unknown
[2024-08-29 10:10:07,984] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-29 11:22:46,264] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[relation-defined-deeply-d5z-shaun-yang1-master-0:755096:0:755096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x54000009)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
==== backtrace (tid: 755096) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x000000000006ce20 PMPI_Comm_set_errhandler()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pcomm_set_errhandler.c:81
 2 0x000000000006ce20 opal_atomic_add_fetch_32()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/include/opal/sys/atomic_impl.h:384
 3 0x000000000006ce20 opal_thread_add_fetch_32()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/threads/thread_usage.h:152
 4 0x000000000006ce20 opal_obj_update()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:534
 5 0x000000000006ce20 PMPI_Comm_set_errhandler()  /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pcomm_set_errhandler.c:70
 6 0x000000000002fcb9 __pyx_f_6mpi4py_3MPI_comm_set_eh()  /mpi4py/src/mpi4py.MPI.c:40330
 7 0x000000000002fcb9 __pyx_f_6mpi4py_3MPI_initialize()  /mpi4py/src/mpi4py.MPI.c:8406
 8 0x000000000002fcb9 __pyx_f_6mpi4py_3MPI_initialize()  /mpi4py/src/mpi4py.MPI.c:8378
 9 0x000000000002fcb9 __pyx_pymod_exec_MPI()  /mpi4py/src/mpi4py.MPI.c:176985
10 0x00000000002371df PyModule_ExecDef()  ???:0
11 0x0000000000237ab0 PyInit__thread()  ???:0
12 0x000000000015a6e4 PyObject_GenericGetAttr()  ???:0
13 0x0000000000145a9d _PyEval_EvalFrameDefault()  ???:0
14 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
15 0x0000000000204612 _PyLong_Format()  ???:0
16 0x000000000014abf1 _PyEval_EvalFrameDefault()  ???:0
17 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
18 0x0000000000204612 _PyLong_Format()  ???:0
19 0x000000000014a566 _PyEval_EvalFrameDefault()  ???:0
20 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
21 0x0000000000204612 _PyLong_Format()  ???:0
22 0x000000000014a4b4 _PyEval_EvalFrameDefault()  ???:0
23 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
24 0x0000000000204612 _PyLong_Format()  ???:0
25 0x000000000014a4b4 _PyEval_EvalFrameDefault()  ???:0
26 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
27 0x000000000015a9b4 PyObject_CallFunctionObjArgs()  ???:0
28 0x000000000023b1bf _PyObject_CallMethodIdObjArgs()  ???:0
29 0x000000000016f113 PyImport_ImportModuleLevelObject()  ???:0
30 0x000000000017f428 PyImport_Import()  ???:0
31 0x000000000015ac9e PyObject_CallFunctionObjArgs()  ???:0
32 0x0000000000169d4b PyObject_Call()  ???:0
33 0x0000000000145a9d _PyEval_EvalFrameDefault()  ???:0
34 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
35 0x0000000000204612 _PyLong_Format()  ???:0
36 0x000000000014a4b4 _PyEval_EvalFrameDefault()  ???:0
37 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
38 0x000000000015a9b4 PyObject_CallFunctionObjArgs()  ???:0
39 0x000000000023b1bf _PyObject_CallMethodIdObjArgs()  ???:0
40 0x000000000016fba1 PyImport_ImportModuleLevelObject()  ???:0
41 0x000000000014635b _PyEval_EvalFrameDefault()  ???:0
42 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
43 0x0000000000204612 _PyLong_Format()  ???:0
44 0x000000000014aba2 _PyEval_EvalFrameDefault()  ???:0
45 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
46 0x0000000000204612 _PyLong_Format()  ???:0
47 0x000000000014aba2 _PyEval_EvalFrameDefault()  ???:0
48 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
49 0x0000000000204612 _PyLong_Format()  ???:0
50 0x000000000014aba2 _PyEval_EvalFrameDefault()  ???:0
51 0x0000000000169111 PyMethod_New()  ???:0
52 0x0000000000204612 _PyLong_Format()  ???:0
53 0x000000000014abf1 _PyEval_EvalFrameDefault()  ???:0
54 0x000000000015b59c _PyFunction_Vectorcall()  ???:0
55 0x0000000000204612 _PyLong_Format()  ???:0
56 0x000000000014a566 _PyEval_EvalFrameDefault()  ???:0

目前定位到出错的代码位置 deepspeed.comm.comm.mpi_discovery

from mpi4py import MPI

使用mpi4py替换mpi4y-mpich后,不再出现上述问题