facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.84k stars 640 forks source link

Ray Launcher: Additional properties are not allowed #2927

Open philkohl opened 4 months ago

philkohl commented 4 months ago

🐛 Bug

Description

Hi,

thank you for this beautiful library. Enjoying it for years! The last days I wanted to use the multirun command on an AWS cluster. And I noticed that you provide a ray launcher: https://hydra.cc/docs/plugins/ray_launcher/

I started with the "simple app" example from the documentation (https://github.com/facebookresearch/hydra/tree/main/plugins/hydra_ray_launcher/examples/simple). Running it out of the box, resulted in an error jsonschema.exceptions.ValidationError: Additional properties are not allowed ('autoscaling_mode', 'initial_workers', 'target_utilization_fraction' were unexpected). See the full output under Stack trace/error message.

So at that point I thought of creating my own custom ray aws config and gave it a try to comment out the additional properties. To follow along and reproduce my steps I created a little repo: https://github.com/philkohl/hydra-ray-aws-example

With this workaround I was able to start a ray head node. But I was not able to submit the tasks due to an import error ImportError: attempted relative import with no known parent package. For details see the second stack trace below.

Is there something wrong in my config or is there an issue in the plugin?

EDIT: I think I found a problem in my config for creating the python environment. I pushed a change to my repo. But I still face the import problem.

Therefore, I tested a workaround to replace all relative import to absolute imports via package notation in my site-packages for hydra_ray_launcher. E.g.:

from hydra_plugins.hydra_ray_launcher._launcher_util import (
    JOB_RETURN_PICKLE,
    JOB_SPEC_PICKLE,
    launch_job_on_ray,
    start_ray,
)

instead of

from ._launcher_util import (
    JOB_RETURN_PICKLE,
    JOB_SPEC_PICKLE,
    launch_job_on_ray,
    start_ray,
)

With this change it seems to work.

Checklist

To reproduce

Minimal Code/Config snippet to reproduce

  1. Follow along the documentation for the "Simple app" (https://hydra.cc/docs/plugins/ray_launcher/)
  2. I also created a repository for reproduction: https://github.com/philkohl/hydra-ray-aws-example

Stack trace/error message

[2024-07-21 10:43:47,244][HYDRA] Ray Launcher is launching 3 jobs, 
[2024-07-21 10:43:47,244][HYDRA]        #0 : task=1
[2024-07-21 10:43:47,319][HYDRA]        #1 : task=2
[2024-07-21 10:43:47,391][HYDRA]        #2 : task=3
[2024-07-21 10:43:47,469][HYDRA] Pickle for jobs: /tmp/tmp6574t01r/job_spec.pkl
Cluster: default

2024-07-21 10:43:47,480 INFO util.py:382 -- setting max workers for head node type to 0
Checking AWS environment settings
Traceback (most recent call last):
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 466, in <lambda>
    lambda: hydra.multirun(
            ^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 162, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 177, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/ray_aws_launcher.py", line 62, in launch
    return _core_aws.launch(
           ^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 106, in launch
    return launch_jobs(
           ^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 117, in launch_jobs
    sdk.create_or_update_cluster(
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/sdk/sdk.py", line 38, in create_or_update_cluster
    return commands.create_or_update_cluster(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 314, in create_or_update_cluster
    config = _bootstrap_config(config, no_config_cache=no_config_cache)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 408, in _bootstrap_config
    validate_config(config)
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/util.py", line 162, in validate_config
    raise e from None
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/util.py", line 160, in validate_config
    jsonschema.validate(config, schema)
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/jsonschema/validators.py", line 1332, in validate
    raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('autoscaling_mode', 'initial_workers', 'target_utilization_fraction' were unexpected)
Traceback (most recent call last):
  File "/tmp/tmp.tTUvsfPQn6/_remote_invoke.py", line 18, in <module>
    from ._launcher_util import (
ImportError: attempted relative import with no known parent package
Shared connection to 3.79.57.96 closed.
Traceback (most recent call last):
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/utils.py", line 466, in <lambda>
    lambda: hydra.multirun(
            ^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 162, in multirun
    ret = sweeper.sweep(arguments=task_overrides)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra/_internal/core_plugins/basic_sweeper.py", line 177, in sweep
    results = self.launcher.launch(batch, initial_job_idx=initial_job_idx)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/ray_aws_launcher.py", line 62, in launch
    return _core_aws.launch(
           ^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 106, in launch
    return launch_jobs(
           ^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib/python3.12/site-packages/hydra_plugins/hydra_ray_launcher/_core_aws.py", line 154, in launch_jobs
    sdk.run_on_cluster(
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/sdk/sdk.py", line 109, in run_on_cluster
    return commands.exec_cluster(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 1167, in exec_cluster
    result = _exec(
             ^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/commands.py", line 1233, in _exec
    return updater.cmd_runner.run(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 383, in run
    return self._run_helper(final_cmd, with_output, exit_on_fail, silent=silent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/philipp/Repositories/hydra-aws-example/.venv/lib64/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 291, in _run_helper
    raise click.ClickException(
click.exceptions.ClickException: Command failed:

  ssh -tt -i /home/philipp/.ssh/hydra-philipp.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_7aa2b466ee/7505d64a54/%C -o ControlPersist=10s -o ConnectTimeout=120s ec2-user@3.79.57.96 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /tmp/tmp.tTUvsfPQn6/_remote_invoke.py /tmp/tmp.tTUvsfPQn6)'

Expected Behavior

  1. Spin up Ray cluster
  2. Submit tasks
  3. See a kind of this output like in the documentation
$ python my_app.py --multirun task=1,2,3
[HYDRA] Ray Launcher is launching 3 jobs, 
[HYDRA]        #0 : task=1
[HYDRA]        #1 : task=2
[HYDRA]        #2 : task=3
[HYDRA] Pickle for jobs: /var/folders/n_/9qzct77j68j6n9lh0lw3vjqcn96zxl/T/tmpqqg4v4i7/job_spec.pkl
Cluster: default
...
INFO services.py:1172 -- View the Ray dashboard at http://localhost:8265
(pid=3374) [__main__][INFO] - Executing task 1
(pid=3374) [__main__][INFO] - Executing task 2
(pid=3374) [__main__][INFO] - Executing task 3
...
[HYDRA] Stopping cluster now. (stop_cluster=true)
[HYDRA] Deleted the cluster (provider.cache_stopped_nodes=false)
Destroying cluster. Confirm [y/N]: y [automatic, due to --yes]
...
No nodes remaining.

System information

msminhas93 commented 1 month ago

I am facing the same issue with the ray aws example provided in the repo. @omry, could you please give any suggestions/help Thank you!

  raise error
jsonschema.exceptions.ValidationError: Additional properties are not allowed ('autoscaling_mode', 'initial_workers', 'target_utilization_fraction' were unexpected)
hydra-core                    1.3.2
hydra-ray-launcher            1.2.1
ray                           2.38.0