Open 91he opened 4 months ago
附议,目前在使用mpi4py-mpich时已经遇到了问题,非分布式启动任务时报错。
python ./xtuner/tools/train.py llava_v15_7b_pretrain deepspeed_zero2 --seed 42
报错信息
08/29 09:56:45 - mmengine - INFO - Dispatch LlamaFlashAttention2 forward. Due to the implementation of the PyTorch version of flash attention, even when the `output_attentions` flag is set to True, it is not possible to return the `attn_weights`.
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
[2024-08-29 09:57:01,183] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.15.0, git-hash=unknown, git-branch=unknown
[2024-08-29 10:10:07,984] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-08-29 11:22:46,264] [INFO] [comm.py:667:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment...
[relation-defined-deeply-d5z-shaun-yang1-master-0:755096:0:755096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x54000009)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
BFD: DWARF error: section .debug_info is larger than its filesize! (0x5d4f30 vs 0x4369b8)
==== backtrace (tid: 755096) ====
0 0x0000000000042520 __sigaction() ???:0
1 0x000000000006ce20 PMPI_Comm_set_errhandler() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pcomm_set_errhandler.c:81
2 0x000000000006ce20 opal_atomic_add_fetch_32() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/include/opal/sys/atomic_impl.h:384
3 0x000000000006ce20 opal_thread_add_fetch_32() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/threads/thread_usage.h:152
4 0x000000000006ce20 opal_obj_update() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/../../../../opal/class/opal_object.h:534
5 0x000000000006ce20 PMPI_Comm_set_errhandler() /build-result/src/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ompi-db10576f403e833fdf7cd0d938e66b8393b20680/ompi/mpi/c/profile/pcomm_set_errhandler.c:70
6 0x000000000002fcb9 __pyx_f_6mpi4py_3MPI_comm_set_eh() /mpi4py/src/mpi4py.MPI.c:40330
7 0x000000000002fcb9 __pyx_f_6mpi4py_3MPI_initialize() /mpi4py/src/mpi4py.MPI.c:8406
8 0x000000000002fcb9 __pyx_f_6mpi4py_3MPI_initialize() /mpi4py/src/mpi4py.MPI.c:8378
9 0x000000000002fcb9 __pyx_pymod_exec_MPI() /mpi4py/src/mpi4py.MPI.c:176985
10 0x00000000002371df PyModule_ExecDef() ???:0
11 0x0000000000237ab0 PyInit__thread() ???:0
12 0x000000000015a6e4 PyObject_GenericGetAttr() ???:0
13 0x0000000000145a9d _PyEval_EvalFrameDefault() ???:0
14 0x000000000015b59c _PyFunction_Vectorcall() ???:0
15 0x0000000000204612 _PyLong_Format() ???:0
16 0x000000000014abf1 _PyEval_EvalFrameDefault() ???:0
17 0x000000000015b59c _PyFunction_Vectorcall() ???:0
18 0x0000000000204612 _PyLong_Format() ???:0
19 0x000000000014a566 _PyEval_EvalFrameDefault() ???:0
20 0x000000000015b59c _PyFunction_Vectorcall() ???:0
21 0x0000000000204612 _PyLong_Format() ???:0
22 0x000000000014a4b4 _PyEval_EvalFrameDefault() ???:0
23 0x000000000015b59c _PyFunction_Vectorcall() ???:0
24 0x0000000000204612 _PyLong_Format() ???:0
25 0x000000000014a4b4 _PyEval_EvalFrameDefault() ???:0
26 0x000000000015b59c _PyFunction_Vectorcall() ???:0
27 0x000000000015a9b4 PyObject_CallFunctionObjArgs() ???:0
28 0x000000000023b1bf _PyObject_CallMethodIdObjArgs() ???:0
29 0x000000000016f113 PyImport_ImportModuleLevelObject() ???:0
30 0x000000000017f428 PyImport_Import() ???:0
31 0x000000000015ac9e PyObject_CallFunctionObjArgs() ???:0
32 0x0000000000169d4b PyObject_Call() ???:0
33 0x0000000000145a9d _PyEval_EvalFrameDefault() ???:0
34 0x000000000015b59c _PyFunction_Vectorcall() ???:0
35 0x0000000000204612 _PyLong_Format() ???:0
36 0x000000000014a4b4 _PyEval_EvalFrameDefault() ???:0
37 0x000000000015b59c _PyFunction_Vectorcall() ???:0
38 0x000000000015a9b4 PyObject_CallFunctionObjArgs() ???:0
39 0x000000000023b1bf _PyObject_CallMethodIdObjArgs() ???:0
40 0x000000000016fba1 PyImport_ImportModuleLevelObject() ???:0
41 0x000000000014635b _PyEval_EvalFrameDefault() ???:0
42 0x000000000015b59c _PyFunction_Vectorcall() ???:0
43 0x0000000000204612 _PyLong_Format() ???:0
44 0x000000000014aba2 _PyEval_EvalFrameDefault() ???:0
45 0x000000000015b59c _PyFunction_Vectorcall() ???:0
46 0x0000000000204612 _PyLong_Format() ???:0
47 0x000000000014aba2 _PyEval_EvalFrameDefault() ???:0
48 0x000000000015b59c _PyFunction_Vectorcall() ???:0
49 0x0000000000204612 _PyLong_Format() ???:0
50 0x000000000014aba2 _PyEval_EvalFrameDefault() ???:0
51 0x0000000000169111 PyMethod_New() ???:0
52 0x0000000000204612 _PyLong_Format() ???:0
53 0x000000000014abf1 _PyEval_EvalFrameDefault() ???:0
54 0x000000000015b59c _PyFunction_Vectorcall() ???:0
55 0x0000000000204612 _PyLong_Format() ???:0
56 0x000000000014a566 _PyEval_EvalFrameDefault() ???:0
目前定位到出错的代码位置 deepspeed.comm.comm.mpi_discovery
from mpi4py import MPI
使用mpi4py替换mpi4y-mpich后,不再出现上述问题
根据 https://github.com/mpi4py/mpi4py/issues/463 https://github.com/mpi4py/mpi4py/issues/508所述,mpi4py-mpich并没有被很好的维护(且不支持linux aarch64架构),建议将此依赖变更为mpi4py。