Hydra multi-run not supported with remote execution

cajewsa commented 1 year ago

Describe the bug

I would like to combine hydra multi-runs with ClearML remote execution. I.e. configuring a multi-run task with hydra:

trainer:
  max_epochs: 500

hydra:
  mode: MULTIRUN
  sweeper:
    params:
      model.width: 256,1024
      model.depth: 1,3
      model.dropout: 0.0,0.5

This can create long-running tasks, that by default is being executed on my local machine sequentially, but I would like to benefit from parrallellization of our agent/worker setup on ClearML. Therefore, in my script I have added:

task.execute_remotely("default")

My problem is now that with execute_remotely and exit_process=True (default), the multi-run is being killed entirely at the first instance.

One workaround could be to execute_remotely("default", clone=True, exit_process=False) and then manually terminate execution. To me, this seems like a bad fix to what should be supported behaviour.

Ideally, exit_process would not use sys.exit, which kills entirely, but something that simply terminates the single hydra task. I have initiated a discussion on the Hydra Github on what signal that could be.