[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 175, in
main()
File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 48, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(opt, cfg))
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 58, in main_worker
init_dist(opt)
File "/home/ps/codes/res-loglikelihood-regression-master/rlepose/utils/env.py", line 24, in init_dist
dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url,
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 212, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store
return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. [W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. [W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. [W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). [W socket.cpp:401] [c10d] The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use). [E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address. Traceback (most recent call last): File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 175, in
main()
File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 48, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(opt, cfg))
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/ps/codes/res-loglikelihood-regression-master/scripts/train.py", line 58, in main_worker init_dist(opt) File "/home/ps/codes/res-loglikelihood-regression-master/rlepose/utils/env.py", line 24, in init_dist dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url, File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 212, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout) File "/home/ps/anaconda3/envs/torch/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 188, in _create_c10d_store return TCPStore( RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:23456 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:23456 (errno: 98 - Address already in use).
Process finished with exit code 1
直接训练报错内容如上,并且奇怪的是在下图位置打上断点后,再继续往后debug 后面代码可以顺利运行,不会报此种错误,只会在run以及第一个debug断点在下图代码之后才会报错