allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.58k stars 644 forks source link

Hydra multi-run not supported with remote execution #905

Open cajewsa opened 1 year ago

cajewsa commented 1 year ago

Describe the bug

I would like to combine hydra multi-runs with ClearML remote execution. I.e. configuring a multi-run task with hydra:

trainer:
  max_epochs: 500

hydra:
  mode: MULTIRUN
  sweeper:
    params:
      model.width: 256,1024
      model.depth: 1,3
      model.dropout: 0.0,0.5

This can create long-running tasks, that by default is being executed on my local machine sequentially, but I would like to benefit from parrallellization of our agent/worker setup on ClearML. Therefore, in my script I have added:

task.execute_remotely("default")

My problem is now that with execute_remotely and exit_process=True (default), the multi-run is being killed entirely at the first instance.

One workaround could be to execute_remotely("default", clone=True, exit_process=False) and then manually terminate execution. To me, this seems like a bad fix to what should be supported behaviour.

Ideally, exit_process would not use sys.exit, which kills entirely, but something that simply terminates the single hydra task. I have initiated a discussion on the Hydra Github on what signal that could be.

Environment

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1675415221867419

ainoam commented 1 year ago

Thanks for reporting @cajewsa, we'll update here as we progress towards a fix.

natephysics commented 1 year ago

I'd like to add my support for this feature.

d13g0 commented 1 year ago

+1 👍🏼