Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Traceback (most recent call last):
File "/mnt/lustre/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/mnt/lustre/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 837, in _invoke_run
self._initialize_workers(self._worker_group)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/mnt/petrelfs/zhaozhiyuan/anaconda3/envs/ofa/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
RuntimeError: Address already in use
Then I watch the node I notice that only one of two nodes is running and the another one has no progress running.
I want to train OFA on slurm cluster (2 nodes and each node has 8gpus).
Following the official instructions of OFA, I add --distributed_port=12345, and run script is as follows:
Then I run
which gives me 2 nodes and each node has 8gpus.
However the progress keeps throwing out error:
Then I watch the node I notice that only one of two nodes is running and the another one has no progress running.
Can you kindly tell me how to fix this? Thanks!