Open hjing100 opened 1 year ago
nohup & :
[16:21:34] WARNING Received 1 death signal, shutting down workers api.py:729
WARNING Sending process 2928786 closing signal SIGHUP api.py:698
WARNING Sending process 2928787 closing signal SIGHUP api.py:698
WARNING Sending process 2928788 closing signal SIGHUP api.py:698
WARNING Sending process 2928789 closing signal SIGHUP api.py:698
WARNING Sending process 2928790 closing signal SIGHUP api.py:698
WARNING Sending process 2928791 closing signal SIGHUP api.py:698
WARNING Sending process 2928792 closing signal SIGHUP api.py:698
WARNING Sending process 2928793 closing signal SIGHUP api.py:698
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/hhh/.conda/envs/python38/bin/accelerate:8 in
Have you solved this problems? In my situation, seems to happen when closing the terminal.
Have you solved this problems? In my situation, seems to happen when closing the terminal.
me too, how did you solved it
I finally solved this problem by using tmux.
--- Logging error --- Traceback (most recent call last): File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run result = self._invoke_run(role) File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 864, in _invoke_run time.sleep(monitor_interval) File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2890585 got signal: 1
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/rich/logging.py", line 170, in emit self.console.print(log_renderable) File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/rich/console.py", line 1684, in print render_options = self.options.update( File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/rich/console.py", line 982, in options max_height=self.size.height, File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/rich/console.py", line 1002, in size if self.is_dumb_terminal: File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/rich/console.py", line 974, in is_dumb_terminal _term = self._environ.get("TERM", "") File "/home/hhh/.conda/envs/python38/lib/python3.8/_collections_abc.py", line 660, in get return self[key] File "/home/hhh/.conda/envs/python38/lib/python3.8/os.py", line 672, in getitem value = self._data[self.encodekey(key)] File "/home/hhh/.conda/envs/python38/lib/python3.8/os.py", line 748, in encode def encode(value): File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2890585 got signal: 1 Call stack: File "/home/hhh/.conda/envs/python38/bin/accelerate", line 8, in
sys.exit(main())
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/accelerate/commands/launch.py", line 900, in launch_command
deepspeed_launcher(args)
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher
distrib_run.run(args)
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, kwargs)
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 729, in run
log.warning(f"Received {e.sigval} death signal, shutting down workers")
File "/home/hhh/.conda/envs/python38/lib/python3.8/logging/init.py", line 1458, in warning
self._log(WARNING, msg, args, kwargs)
File "/home/hhh/.conda/envs/python38/lib/python3.8/logging/init.py", line 1589, in _log
self.handle(record)
File "/home/hhh/.conda/envs/python38/lib/python3.8/logging/init.py", line 1599, in handle
self.callHandlers(record)
File "/home/hhh/.conda/envs/python38/lib/python3.8/logging/init.py", line 1661, in callHandlers
hdlr.handle(record)
File "/home/hhh/.conda/envs/python38/lib/python3.8/logging/init.py", line 954, in handle
self.emit(record)
File "/home/hhh/.conda/envs/python38/lib/python3.8/site-packages/rich/logging.py", line 172, in emit
self.handleError(record)
Message: 'Received 1 death signal, shutting down workers'