LianjiaTech / BELLE

BELLE: Be Everyone's Large Language model Engine(开源中文对话大模型)
Apache License 2.0
7.88k stars 757 forks source link

finetune报错ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) #413

Open uuser0748 opened 1 year ago

uuser0748 commented 1 year ago

之前在v100 32G4 上也报OOM,然后换到了一台A100 80G1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志

[INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True
[INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights file /workspace/BELLE-7B-2M/pytorch_model.bin
[INFO|configuration_utils.py:575] 2023-05-29 11:09:55,190 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.28.1"
}

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 781 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 782 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 783) of binary: /opt/conda/bin/python3
xianghuisun commented 1 year ago

之前在v100 32G_4 上也报OOM,然后换到了一台A100 80G_1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志

[INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True
[INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights file /workspace/BELLE-7B-2M/pytorch_model.bin
[INFO|configuration_utils.py:575] 2023-05-29 11:09:55,190 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.28.1"
}

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 781 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 782 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 783) of binary: /opt/conda/bin/python3

麻烦您把日志贴的详细些 另外您可尝试bloomz更小的模型,这样就知道是不是显存不足的问题了。

dong-yu-czl commented 11 months ago

之前在v100 32G_4 上也报OOM,然后换到了一台A100 80G_1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志

[INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True
[INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights file /workspace/BELLE-7B-2M/pytorch_model.bin
[INFO|configuration_utils.py:575] 2023-05-29 11:09:55,190 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 3,
  "transformers_version": "4.28.1"
}

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 781 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 782 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 783) of binary: /opt/conda/bin/python3

请问您解决了这个问题了吗?