Thank you for your work!
However, I have encountered an issue. During both stage 1 and stage 2 training, the training stops unexpectedly for unknown reasons. This typically occurs after training for several thousand steps.
Attached is the log where my training stage abruptly stopped.
I really hope you can provide an answer.
Steps: 58%|█████▊ | 5831/10000 [3:27:42<2:04:26, 1.79s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0255, td=0.06s]
Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0366, td=0.05s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68694 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68695 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68696 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68697 closing signal SIGHUP
Traceback (most recent call last):
File "/data/Moore-AnimateAnyone/.venv/bin/accelerate", line 8, in
sys.exit(main())
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 68556 got signal: 1
Thank you for your work! However, I have encountered an issue. During both stage 1 and stage 2 training, the training stops unexpectedly for unknown reasons. This typically occurs after training for several thousand steps.
Attached is the log where my training stage abruptly stopped. I really hope you can provide an answer.
Steps: 58%|█████▊ | 5831/10000 [3:27:42<2:04:26, 1.79s/it, lr=1e-5, step_loss=0.0255, td=0.06s] Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0255, td=0.06s] Steps: 58%|█████▊ | 5832/10000 [3:27:44<2:04:41, 1.80s/it, lr=1e-5, step_loss=0.0366, td=0.05s]WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68694 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68695 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68696 closing signal SIGHUP WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 68697 closing signal SIGHUP Traceback (most recent call last): File "/data/Moore-AnimateAnyone/.venv/bin/accelerate", line 8, in
sys.exit(main())
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 970, in launch_command
multi_gpu_launcher(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run
time.sleep(monitor_interval)
File "/data/Moore-AnimateAnyone/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 68556 got signal: 1