For help about train - Githubissues

szh404 commented 6 months ago

You have done a great job, thank you very much. However, there are some issues. When I run train.sh, some errors occur. I would like to know how to avoid these errors.

`Traceback (most recent call last): File "scripts/image_train.py", line 86, in main() File "scripts/image_train.py", line 23, in main logger.configure() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/logger.py", line 454, in configure os.makedirs(os.path.expanduser(dir), exist_ok=True) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/os.py", line 213, in makedirs makedirs(head, exist_ok=exist_ok) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/os.py", line 223, in makedirs mkdir(name, mode) PermissionError: [Errno 13] Permission denied: '/dirs' [E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with Connection reset by peer [E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with Connection reset by peer [E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with Connection reset by peer Traceback (most recent call last): File "scripts/image_train.py", line 86, in main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(args, kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group defaultpg, = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Gloo connectFullMesh failed with Connection reset by peer Traceback (most recent call last): File "scripts/image_train.py", line 86, in main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(args, *kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group defaultpg, = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Gloo connectFullMesh failed with Connection reset by peer Traceback (most recent call last): File "scripts/image_train.py", line 86, in main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(args, kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group defaultpg, = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Gloo connectFullMesh failed with Connection reset by peer

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[26054,1],0] Exit code: 1`

Coordi777 commented 6 months ago

The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊

szh404 commented 6 months ago

The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊

Thank you for your guidance, your advice has been very effective. I really appreciate it.

After running sh train.sh, error appeared.

Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout

I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right?

In summary, thank you very much for your work.

FUIGUIMURONG commented 6 months ago

The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊

Thank you for your guidance, your advice has been very effective. I really appreciate it.

After running sh train.sh, error appeared.

Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout

I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right?

In summary, thank you very much for your work.

May I ask where the dataset was downloaded from and what's the dataset structure? It would be great to get your reply.

Coordi777 commented 6 months ago

The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊

Thank you for your guidance, your advice has been very effective. I really appreciate it.

After running sh train.sh, error appeared.

Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout

I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right?

In summary, thank you very much for your work.

Our example uses four GPUs by default. You need to modify "mpiexec -n x" according to your situation. For example, if you have one GPU, "mpiexec -n 1" in train.sh

szh404 commented 6 months ago

The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊

Thank you for your guidance, your advice has been very effective. I really appreciate it. After running sh train.sh, error appeared. Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right? In summary, thank you very much for your work.

Our example uses four GPUs by default. You need to modify "mpiexec -n x" according to your situation. For example, if you have one GPU, "mpiexec -n 1" in train.sh

Thank you very much for your patience and guidance., I will try. May you succeed in your studies and may everything go smoothly for you.

szh404 commented 5 months ago

Sorry, I still can not fix some problems.

I checked your respotiry's issues and guided-diffusion's issues, but I didn't meet the same questions.

I would appreciate it if you could give me some advice.

Best wishes

szh404 commented 5 months ago

Sorry, I have already run the program. The issues mentioned above are due to my computer's environment. I have just used Autodl's cloud GPU and it is running smoothly now. I will close this issue. Thank you！

Coordi777 / Conditional-Diffusion-for-SAR-to-Optical-Image-Translation

For help about train #1

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.