facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.83k stars 637 forks source link

[Bug] Ray Hydra Launcher fails to connect to the GCS #2709

Open stwerner97 opened 1 year ago

stwerner97 commented 1 year ago

🐛 Bug

Description

Using the Hydra Ray launcher, I want to submit the Simple Ray Launcher Example of Hydra to a remote Ray Kubernetes cluster. Launching the job via python my_app.py --multirun hydra/launcher=ray hydra.launcher.ray.init.address=localhost:8265, however, fails to connect to the GCS of Ray.

[2023-07-11 16:21:28,604][HYDRA] Ray Launcher is launching 1 jobs, sweep output dir: multirun/2023-07-11/16-21-28
[2023-07-11 16:21:28,604][HYDRA] Initializing ray with config: {'address': 'localhost:8265'}
2023-07-11 16:21:28,604 INFO worker.py:1452 -- Connecting to existing Ray cluster at address: <ip-address>:8265...
2023-07-11 16:21:33,615 ERROR utils.py:1390 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-11 16:21:33,615 WARNING utils.py:1397 -- Unable to connect to GCS at <ip-address>:8265. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
2023-07-11 16:21:40,629 ERROR utils.py:1390 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2023-07-11 16:21:40,630 WARNING utils.py:1397 -- Unable to connect to GCS at <ip-address>:8265. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

I have configured the Ray launcher plugin to point towards the address of the (port-forwarded) Ray dashboard. I verified that I can successfully submit jobs using Ray, i.e., ray job submit --address http://localhost:8265 -- python script.py is successful.

Some information on the Ray Kubernetes cluster:

I didn't find too much information on this issue and am unsure whether this issue belongs to the Hydra or Ray repository.

Checklist

System information

shagunsodhani commented 1 year ago

I am not very familiar with the ray-launcher but I will try to take a stab here.

You mentioned the address http://localhost:8265 in the standalone command ray job submit --address http://localhost:8265 -- python script.py while you use hydra.launcher.ray.init.address=localhost:6379 in the config. Shouldn't these be the same ?

stwerner97 commented 1 year ago

Hi @shagunsodhani , thanks for responding! 😊

Yes, they should be the same, I made a copy-paste error when reporting the issue. The issue also occurs for hydra.launcher.ray.init.address=localhost:8265. I've edited the initial issue report and corrected the port number.