allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.57k stars 644 forks source link

Remote execution of HPO example breaks with "Invalid task id: id=" #1151

Open vaskokj opened 10 months ago

vaskokj commented 10 months ago

Describe the bug

Using the RPO example

git clone https://github.com/allegroai/clearml/tree/master/examples/optimization/hyper-parameter-optimization

On my local system (which is configured to talk to a local clearml instance).

I ran run

python ./hyper_parameter_optimizer.py

only edit I made is changed execution_queue = '1xGPU' to execution_queue = 'my_queue_name' the code works fine.

I tried to put this in a clearml-task

clearml-task --project="Hyper-Parameter Optimization" \
--name="Automatic Hyper-Parameter Optimization" \
--repo https://<mylocalrepo>/clearml-examples.git \
--script clearml-hpo/hyper_parameter_optimizer.py \
--queue my_queue_name

and get the following error(s). Its acting like it can't find the 'Keras HP optimization base'.

Apologies, it seems you do not have 'optuna' or 'hpbandster' installed, we will be using RandomSearch strategy instead
2023-11-13 22:13:46,090 - clearml.util - WARNING - 10 task found when searching for `{'project_name': 'examples', 'task_name': 'Keras HP optimization base', 'include_archived': True}`
2023-11-13 22:13:46,090 - clearml.util - WARNING - Selected task `Keras HP optimization base` (id=6f7a15e1c4024d1e876660fa792dda3c)
2023-11-13 22:13:46,123 - clearml - WARNING - Could not retrieve remote configuration named 'General'
Using default configuration: {'parameter_optimization_space': [{'type': 'UniformIntegerParameterRange', 'name': 'General/layer_1', 'min_value': 128, 'max_value': 512, 'step_size': 128, 'include_max': True}, {'type': 'UniformIntegerParameterRange', 'name': 'General/layer_2', 'min_value': 128, 'max_value': 512, 'step_size': 128, 'include_max': True}, {'type': 'DiscreteParameterRange', 'name': 'General/batch_size', 'values': [96, 128, 160]}, {'type': 'DiscreteParameterRange', 'name': 'General/epochs', 'values': [30]}]}
2023-11-13 22:13:46,669 - clearml.util - WARNING - 128 task found when searching for `{'include_archived': True}`
2023-11-13 22:13:46,670 - clearml.util - WARNING - Selected task `Automatic Hyper-Parameter Optimization` (id=769a29b9af0a4558b73eb45570f8f2fa)
2023-11-13 22:13:46,681 - clearml.automation.optimization - WARNING - Could not find requested hyper-parameters ['General/layer_1', 'General/layer_2', 'General/batch_size', 'General/epochs'] on base task 
2023-11-13 22:13:46,693 - clearml.automation.optimization - WARNING - Could not find requested metric ('epoch_accuracy', 'epoch_accuracy') report on base task 
Progress report #0 completed, sleeping for 0.25 minutes
2023-11-13 22:13:46,829 - clearml.util - WARNING - 128 task found when searching for `{'include_archived': True}`
2023-11-13 22:13:46,829 - clearml.util - WARNING - Selected task `Automatic Hyper-Parameter Optimization` (id=769a29b9af0a4558b73eb45570f8f2fa)
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/automation/optimization.py", line 1703, in _daemon
    self.optimizer.start()
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/automation/optimization.py", line 353, in start
    if not self.process_step():
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/automation/optimization.py", line 407, in process_step
    new_job = self.create_job()
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/automation/optimization.py", line 1078, in create_job
    return self.helper_create_job(base_task_id=self._base_task_id, parameter_override=parameters)
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/automation/optimization.py", line 733, in helper_create_job
    new_job = self._job_class(
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/automation/job.py", line 638, in __init__
    self.task = Task.clone(
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/task.py", line 1300, in clone
    cloned_task_id = cls._clone_task(cloned_task_id=task_id, name=name, comment=comment,
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/backend_interface/task/task.py", line 2727, in _clone_task
    res = cls._send(
  File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/clearml/backend_interface/base.py", line 107, in _send
    raise SendError(res, error_msg)
**clearml.backend_interface.session.SendError: Action failed <400/101: tasks.clone/v1.0 (Invalid task id: id=)> (task=, new_task_name=Automatic Hyper-Parameter Optimization: General/batch_size=128 General/epochs=30 General/layer_1=384 General/layer_2=256, new_task_comment=General/batch_size=128**
General/epochs=30
General/layer_1=384
General/layer_2=256, new_task_parent=769a29b9af0a4558b73eb45570f8f2fa, new_task_project=8cbdc880c25747a88bc41e12cc74378c)
[]
We are done, good bye

For comparison this is my results if I run it with `python ./hyper_parameter_optimizer.py .

2023-11-13 20:53:42,398 - clearml.util - WARNING - 10 task found when searching for `{'project_name': 'examples', 'task_name': 'Keras HP optimization base', 'include_archived': True}`
2023-11-13 20:53:42,399 - clearml.util - WARNING - Selected task `Keras HP optimization base` (id=6f7a15e1c4024d1e876660fa792dda3c)
Progress report #0 completed, sleeping for 0.25 minutes
2023-11-13 20:53:42,771 - clearml.automation.optimization - INFO - Creating new Task: {'General/layer_1': 384, 'General/layer_2': 256, 'General/batch_size': 128, 'General/epochs': 30}
2023-11-13 20:53:42,978 - clearml.automation.optimization - INFO - Creating new Task: {'General/layer_1': 384, 'General/layer_2': 384, 'General/batch_size': 128, 'General/epochs': 30}
1699908823827 ip-10-52-156-47 info ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
1699908837668 ip-10-52-156-47 info Progress report #1 completed, sleeping for 0.2 minutes
1699908849722 ip-10-52-156-47 info Progress report #2 completed, sleeping for 0.2 minutes
1699908861776 ip-10-52-156-47 info Progress report #3 completed, sleeping for 0.2 minutes
1699908873837 ip-10-52-156-47 info Progress report #4 completed, sleeping for 0.2 minutes
1699908885892 ip-10-52-156-47 info Progress report #5 completed, sleeping for 0.2 minutes
1699908897948 ip-10-52-156-47 info Progress report #6 completed, sleeping for 0.2 minutes
1699908910004 ip-10-52-156-47 info Progress report #7 completed, sleeping for 0.2 minutes
1699908922059 ip-10-52-156-47 info Progress report #8 completed, sleeping for 0.2 minutes
1699908934111 ip-10-52-156-47 info Progress report #9 completed, sleeping for 0.2 minutes
1699908944210 ip-10-52-156-47 info 2023-11-13 20:55:44,210 - clearml.automation.optimization - INFO - Creating new Task: {'General/layer_1': 384, 'General/layer_2': 128, 'General/batch_size': 128, 'General/epochs': 30}
1699908946257 ip-10-52-156-47 info Job completed! 028b607db22b43e49f5f1a54d1ff8edc 1.0 29 {'status': 'completed', 'General/layer_1': 384, 'General/layer_2': 256, 'General/batch_size': 128, 'General/epochs': 30}
WOOT WOOT we broke the record! Objective reached 1.0
Updating job performance summary plot/table
...
...
...

To reproduce

  1. clone repo
  2. change queue (don't think this is related)
  3. run base_template_keras_simple.py by python base_template_keras_simple.py OR using clearml-task
    clearml-task --project "examples" \
    --name "Keras HP optimization base" \
    --repo https://my_repo/clearml-examples.git \
    --script clearml-hpo/base_template_keras_simple.py \
    --queue my_queue
  4. run the hyper_parameter_optimizer.py script
clearml-task --project="Hyper-Parameter Optimization" \
--name="Automatic Hyper-Parameter Optimization" \
--repo https://<mylocalrepo>/clearml-examples.git \
--script clearml-hpo/hyper_parameter_optimizer.py \
--queue my_queue_name
  1. "fails" to run because it can't find the base_template_keras_simple.py run id even though its there.

Expected behaviour

Should have executed the same way as running python hyper_parameter_optimizer.py

This error seems most related but the fix is already implemented in the HPO code and im using a fixed version. https://github.com/allegroai/clearml/issues/274

Environment

ainoam commented 10 months ago

Thanks for reporting @vaskokj.

Something indeed seems to be funky in this scenario. We'll investigate and update when we have a fix.

WolodjaZ commented 1 month ago

I have a similar problem, is there a solution to the problem?

ainoam commented 1 month ago

Still pending @WolodjaZ - Hope to have a fix soon.