Open V1oletM opened 7 months ago
Hi, I'm using pytorch=2.0.0
with python=3.8
. The environment.yaml
file can be found at https://pastebin.com/DCuA0us6.
Please note that the actual error outputs will always come after the ChildFailedError
or before ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 75972) of binary
, when using distributed training. The incomplete traceback you provided above does not contain any specific failure details, so please pay attention to the other part of the traceback log.
Hi, FutureXiang Thanks for your code! When I'm training CIFAR-10, I encounter an error during distributed training.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 75972) of binary: /home/wangyiming/anaconda3/envs/diffusion/bin/python Traceback (most recent call last): File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I'm not sure if it's a version issue. Could you please provide the environment.yaml file? Thanks!