NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.87k stars 2.23k forks source link

why errno: 97 - Address family not supported by protocol #301

Closed FarmerLiuAng closed 2 weeks ago

FarmerLiuAng commented 1 year ago

When I use run_text_generation_server_345M.sh, this error happens. Could you help me?

[W socket.cpp:424] [c10d] The server socket cannot be initialized on [::]:6000 (errno: 97 - Address family not supported by protocol). [W socket.cpp:599] [c10d] The client socket cannot be initialized to connect to [localhost]:6000 (errno: 97 - Address family not supported by protocol). [W socket.cpp:599] [c10d] The client socket cannot be initialized to connect to [localhost]:6000 (errno: 97 - Address family not supported by protocol). Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 765, in main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 236, in launch_agent result = agent.run() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run result = self._invoke_run(role) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 846, in _invoke_run self._initialize_workers(self._worker_group) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 680, in _initialize_workers self._rendezvous(worker_group) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper result = f(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 551, in _rendezvous master_addr, master_port = self._get_master_addr_port(store) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 524, in _get_master_addr_port master_addr = os.environ['MASTER_ADDR'] File "/opt/conda/lib/python3.8/os.py", line 675, in getitem raise KeyError(key) from None KeyError: 'MASTER_ADDR'

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

jon-barker commented 1 year ago

Hi. I'm not able to replicate this issue. Did you modify run_text_generation_server_345M.sh? Also, are you running inside a container? If so, which one?

FarmerLiuAng commented 1 year ago

Thanks for your reply. I didn't modify the script. And I ran in " nvcr.io/nvidia/pytorch:21.09-py3".

jon-barker commented 1 year ago

Could you please try running with a more recent container? I'd recommend 23.04 right now. Thanks

github-actions[bot] commented 12 months ago

Marking as stale. No activity in 60 days.