Closed szh404 closed 5 months ago
The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊
The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊
Thank you for your guidance, your advice has been very effective. I really appreciate it.
After running sh train.sh, error appeared.
Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout
I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right?
In summary, thank you very much for your work.
The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊
Thank you for your guidance, your advice has been very effective. I really appreciate it.
After running sh train.sh, error appeared.
Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout
I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right?
In summary, thank you very much for your work.
May I ask where the dataset was downloaded from and what's the dataset structure? It would be great to get your reply.
The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊
Thank you for your guidance, your advice has been very effective. I really appreciate it.
After running sh train.sh, error appeared.
Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout
I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right?
In summary, thank you very much for your work.
Our example uses four GPUs by default. You need to modify "mpiexec -n x" according to your situation. For example, if you have one GPU, "mpiexec -n 1" in train.sh
The error message may be "PermissionError:[Errno 13] Permission denied: '/dirs'", you should replace "export OPENAI_LOGDIR=/dirs/to/log" with your own path.😊
Thank you for your guidance, your advice has been very effective. I really appreciate it. After running sh train.sh, error appeared.
Traceback (most recent call last): File "scripts/image_train.py", line 86, in <module> main() File "scripts/image_train.py", line 22, in main dist_util.setup_dist() File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist dist.init_process_group(backend=backend, init_method="env://") File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper func_return = func(*args, **kwargs) File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) RuntimeError: Socket Timeout
I searched some information, now I guess the reason is that my device is single gpu, but the code needs multi-gpu to run. Right? In summary, thank you very much for your work.Our example uses four GPUs by default. You need to modify "mpiexec -n x" according to your situation. For example, if you have one GPU, "mpiexec -n 1" in train.sh
Thank you very much for your patience and guidance., I will try. May you succeed in your studies and may everything go smoothly for you.
Sorry, I still can not fix some problems.
I checked your respotiry's issues and guided-diffusion's issues, but I didn't meet the same questions.
I would appreciate it if you could give me some advice.
Best wishes
Sorry, I have already run the program. The issues mentioned above are due to my computer's environment. I have just used Autodl's cloud GPU and it is running smoothly now. I will close this issue. Thank you!
You have done a great job, thank you very much. However, there are some issues. When I run train.sh, some errors occur. I would like to know how to avoid these errors.
`Traceback (most recent call last): File "scripts/image_train.py", line 86, in
main()
File "scripts/image_train.py", line 23, in main
logger.configure()
File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/logger.py", line 454, in configure
os.makedirs(os.path.expanduser(dir), exist_ok=True)
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/os.py", line 213, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/os.py", line 223, in makedirs
mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/dirs'
[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with Connection reset by peer
[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with Connection reset by peer
[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with Connection reset by peer
Traceback (most recent call last):
File "scripts/image_train.py", line 86, in
main()
File "scripts/image_train.py", line 22, in main
dist_util.setup_dist()
File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist
dist.init_process_group(backend=backend, init_method="env://")
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, kwargs)
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
defaultpg, = _new_process_group_helper(
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Gloo connectFullMesh failed with Connection reset by peer
Traceback (most recent call last):
File "scripts/image_train.py", line 86, in
main()
File "scripts/image_train.py", line 22, in main
dist_util.setup_dist()
File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist
dist.init_process_group(backend=backend, init_method="env://")
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func(*args, *kwargs)
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
defaultpg, = _new_process_group_helper(
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Gloo connectFullMesh failed with Connection reset by peer
Traceback (most recent call last):
File "scripts/image_train.py", line 86, in
main()
File "scripts/image_train.py", line 22, in main
dist_util.setup_dist()
File "/home/shao/repo/Conditional-Diffusion-for-SAR-to-Optical-Image-Translation/guided_diffusion/dist_util.py", line 42, in setup_dist
dist.init_process_group(backend=backend, init_method="env://")
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
func_return = func( args, kwargs)
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1148, in init_process_group
defaultpg, = _new_process_group_helper(
File "/home/shao/anaconda3/envs/torch21-py38/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1264, in _new_process_group_helper
backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
RuntimeError: Gloo connectFullMesh failed with Connection reset by peer
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[26054,1],0] Exit code: 1`