Open uuser0748 opened 1 year ago
之前在v100 32G_4 上也报OOM,然后换到了一台A100 80G_1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志
[INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True [INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights file /workspace/BELLE-7B-2M/pytorch_model.bin [INFO|configuration_utils.py:575] 2023-05-29 11:09:55,190 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 3, "transformers_version": "4.28.1" } WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 781 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 782 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 783) of binary: /opt/conda/bin/python3
麻烦您把日志贴的详细些 另外您可尝试bloomz更小的模型,这样就知道是不是显存不足的问题了。
之前在v100 32G_4 上也报OOM,然后换到了一台A100 80G_1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志
[INFO|modeling_utils.py:2263] 2023-05-29 11:08:21,313 >> Offline mode: forcing local_files_only=True [INFO|modeling_utils.py:2531] 2023-05-29 11:08:21,313 >> loading weights file /workspace/BELLE-7B-2M/pytorch_model.bin [INFO|configuration_utils.py:575] 2023-05-29 11:09:55,190 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 1, "eos_token_id": 2, "pad_token_id": 3, "transformers_version": "4.28.1" } WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 781 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 782 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 784 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 783) of binary: /opt/conda/bin/python3
请问您解决了这个问题了吗?
之前在v100 32G4 上也报OOM,然后换到了一台A100 80G1 就正常跑通了。 现在v100 32G*4上会报标题的错误。而两台机器,8张v100 32G,用scripts/multinode_run.sh 还报这个错误,请问是显存不够的原因吗? 没有打印其他日志