Jeff-sjtu / res-loglikelihood-regression

Code for "Human Pose Regression with Residual Log-likelihood Estimation", ICCV 2021 Oral
421 stars 43 forks source link

ddp端口报错 #60

Closed YHaooo-4508 closed 1 year ago

YHaooo-4508 commented 1 year ago

[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. [W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. [W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. [W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. Traceback (most recent call last): File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 175, in main() File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 48, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(opt, cfg)) File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 58, in main_worker init_dist(opt) File "/home/ps/codes/res-loglikelihood-regression-master/rlepose/utils/env.py", line 24, in init_dist dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url, File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 212, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).

Process finished with exit code 1

直接训练报错内容如上,并且奇怪的是在下图位置打上断点后,再继续往后debug 后面代码可以顺利运行,不会报此种错误,只会在run以及第一个debug断点在下图代码之后才会报错 image