Closed FarmerLiuAng closed 2 weeks ago
Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.
Hi. I'm not able to replicate this issue. Did you modify run_text_generation_server_345M.sh
? Also, are you running inside a container? If so, which one?
Thanks for your reply. I didn't modify the script. And I ran in " nvcr.io/nvidia/pytorch:21.09-py3".
Could you please try running with a more recent container? I'd recommend 23.04 right now. Thanks
Marking as stale. No activity in 60 days.
When I use run_text_generation_server_345M.sh, this error happens. Could you help me?
[W socket.cpp:424] [c10d] The server socket cannot be initialized on [::]:6000 (errno: 97 - Address family not supported by protocol). [W socket.cpp:599] [c10d] The client socket cannot be initialized to connect to [localhost]:6000 (errno: 97 - Address family not supported by protocol). [W socket.cpp:599] [c10d] The client socket cannot be initialized to connect to [localhost]:6000 (errno: 97 - Address family not supported by protocol). Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper
return f(*args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent
result = agent.run()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, *kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
result = self._invoke_run(role)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 846, in _invoke_run
self._initialize_workers(self._worker_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in _initialize_workers
self._rendezvous(worker_group)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 551, in _rendezvous
master_addr, master_port = self._get_master_addr_port(store)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 524, in _get_master_addr_port
master_addr = os.environ['MASTER_ADDR']
File "/opt/conda/lib/python3.8/os.py", line 675, in getitem
raise KeyError(key) from None
KeyError: 'MASTER_ADDR'