FutureXiang / soda

Unofficial implementation of "SODA: Bottleneck Diffusion Models for Representation Learning"
56 stars 2 forks source link

Issues with distributed training environment #5

Open V1oletM opened 1 month ago

V1oletM commented 1 month ago

Hi, FutureXiang Thanks for your code! When I'm training CIFAR-10, I encounter an error during distributed training. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 75972) of binary: /home/wangyiming/anaconda3/envs/diffusion/bin/python Traceback (most recent call last): File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module> main() File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main launch(args) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch run(args) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/wangyiming/anaconda3/envs/diffusion/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: I'm not sure if it's a version issue. Could you please provide the environment.yaml file? Thanks!

FutureXiang commented 1 month ago

Hi, I'm using pytorch=2.0.0 with python=3.8. The environment.yaml file can be found at https://pastebin.com/DCuA0us6.

Please note that the actual error outputs will always come after the ChildFailedError or before ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 75972) of binary, when using distributed training. The incomplete traceback you provided above does not contain any specific failure details, so please pay attention to the other part of the traceback log.