Loading checkpoint shards: 14%|███████ | 1/7 [00:13<01:22, 13.68s/it]W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29734 closing signal SIGTERM
W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29735 closing signal SIGTERM
W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29737 closing signal SIGTERM
E0629 00:06:24.298000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 2 (pid: 29736) of binary: /opt/conda/bin/python3.10
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
any help is appreciated (:
Loading checkpoint shards: 14%|███████ | 1/7 [00:13<01:22, 13.68s/it]W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29734 closing signal SIGTERM W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29735 closing signal SIGTERM W0629 00:06:21.246000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 29737 closing signal SIGTERM E0629 00:06:24.298000 140229302286144 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: -9) local_rank: 2 (pid: 29736) of binary: /opt/conda/bin/python3.10 Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: