训练出错但没有报错信息

2793145003 commented 1 year ago

按照readme里的步骤来的，只把模型换成了llama-2-70B。输出：

Training Epoch: 0 / 1     0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 157   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--  /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training Epoch: 0 / 1     0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--
Training Batch: 0 / 157   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:07 / -:--:--  [2023-08-08 03:08:43,774] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 51194) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
collie.py FAILED
-----------------------------------------------------
Failures:
[1]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 1 (local_rank: 1)
  exitcode  : -7 (pid: 51195)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51195
[2]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 51196)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51196
[3]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 3 (local_rank: 3)
  exitcode  : -7 (pid: 51197)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51197
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_03:08:43
  host      : 58283303bbb0
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 51194)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 51194
=====================================================

重启容器之后恢复正常。再次重启之后换成8卡 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 输出：

Training Epoch: 0 / 1    0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--
Training Batch: 0 / 79   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:00 / -:--:--  /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:426: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly.
The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use
use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
Training Epoch: 0 / 1    0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--
Training Batch: 0 / 79   0% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -- it./s 0:00:13 / -:--:--  [2023-08-08 03:20:09,633] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 96 closing signal SIGTERM
[2023-08-08 03:20:09,639] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 98 closing signal SIGTERM
[2023-08-08 03:20:09,641] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 99 closing signal SIGTERM
[2023-08-08 03:20:09,643] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 100 closing signal SIGTERM
[2023-08-08 03:20:09,645] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 101 closing signal SIGTERM
[2023-08-08 03:20:09,648] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 102 closing signal SIGTERM
[2023-08-08 03:20:11,395] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -7) local_rank: 0 (pid: 95) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.1.0.dev20230725+cu121', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 788, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
==================================================
collie.py FAILED
--------------------------------------------------
Failures:
[1]:
  time      : 2023-08-08_03:20:09
  host      : 06a78451e09d
  rank      : 2 (local_rank: 2)
  exitcode  : -7 (pid: 97)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 97
--------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-08_03:20:09
  host      : 06a78451e09d
  rank      : 0 (local_rank: 0)
  exitcode  : -7 (pid: 95)
  error_file: <N/A>
  traceback : Signal 7 (SIGBUS) received by PID 95
==================================================

请问这种情况应该如何debug呢……

KaiLv69 commented 1 year ago

你好，感谢你的反馈，我们会优化一下训练过程中的报错。目前可以尝试用try catch包一下trainer.train()，比如

try:
    trainer.train()
except BaseException as e:
    import sys
    import traceback
    from rich.console import Console
    file = open("./traceback.log", 'a+')
    sys.stdout = file
    traceback.print_exc(file=file)
    file.write("\n\n")
    Console().print_exception()
    raise e

2793145003 commented 1 year ago

你好，感谢你的反馈，我们会优化一下训练过程中的报错。目前可以尝试用try catch包一下trainer.train()，比如
try:
    trainer.train()
except BaseException as e:
    import sys
    import traceback
    from rich.console import Console
    file = open("./traceback.log", 'a+')
    sys.stdout = file
    traceback.print_exc(file=file)
    file.write("\n\n")
    Console().print_exception()
    raise e

感谢回复！用try catch之后也没有任何报错信息。又搜了一下好像是deepspeed的问题。或者说是docker设置的问题。解决方法在这里： https://github.com/microsoft/DeepSpeed/issues/4002

OpenMOSS / CoLLiE

训练出错但没有报错信息 #102