allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

`reuse_task=True` does not work correctly in `TaskScheduler().add_task()` #1075

Open jday1 opened 1 year ago

jday1 commented 1 year ago

Describe the bug

A clear and concise description of what the bug is.

The functionality reuse_task=True in TaskScheduler().add_task() does not work as expected. The documentation for the class in clearml.automation.scheduler states:

:param reuse_task: If True, re-enqueue the same Task (i.e. do not clone it) every time, default False.

However, if the task is in a completed stated, this fails to work as expected.

Launching job: ScheduleJob(name='alert_queue_wait_time', base_task_id='9d42b40d20c54d26bbfde771a3a0fc45', base_function=None, queue='aws-cpu-c5a-8x-ondemand', target_project=None, single_instance=False, task_parameters=None, task_overrides=None, clone_task=False, _executed_instances=['...'], execution_limit_hours=None, recurring=True, starting_time=datetime.datetime(2023, 7, 17, 18, 29, 0, 580538), minute=5, hour=None, day=None, weekdays=None, month=None, year=None, _next_run=datetime.datetime(2023, 7, 18, 8, 28, 15, 320712), _execution_timeout=None, _last_executed=datetime.datetime(2023, 7, 18, 8, 23, 15, 320712), _schedule_counter=165)
2023-07-18 08:28:16,088 - clearml.automation.job - WARNING - Task cloning disabled but requested Task [9d42b40d20c54d26bbfde771a3a0fc45] status=completed. Reverting to clone Task
2023-07-18 09:28:22
Scheduling Job alert_queue_wait_time, Task 3c24afaa4c914b3ca25d46af0c245ed5 on queue aws-cpu-c5a-8x-ondemand.
Waiting for next run [UTC 2023-07-18 08:33:18.381599], sleeping for 4.97 minutes, until next sync.

This warning happens because this code does not work correctly in clearml.automation.job:

        if disable_clone_task:
            self.task = base_temp_task
            task_status = self.task.status
            if task_status != Task.TaskStatusEnum.created:
                logger.warning('Task cloning disabled but requested Task [{}] status={}. '
                               'Reverting to clone Task'.format(base_task_id, task_status))
                disable_clone_task = False
                self.task = None

This code block needs to be modified to reset the task if it is in a state other than Task.TaskStatusEnum.created.

To reproduce

Exact steps to reproduce the bug. Provide example code if possible.

To reproduce, first create a simple task which we can use for our schedule_task_id. Then create a script:

scheduler = TaskScheduler()

scheduler.add_task(
    name="my_task_name",
    schedule_task_id=<TASK_ID>,
    queue=<QUEUE>,
    minute=5,
    recurring=True,
    reuse_task=True,
)

scheduler.start_remotely(<QUEUE>)

Either use a remote queue or spin up the daemon locally twice (one to run the scheduler, one to pick up the tasks). In the clearml console, you will then see a series of completed tasks.

E.g.

image

Expected behaviour

What is the expected behaviour? What should've happened but didn't?

What should happen is that there is a single task which completes, then the scheduler resets it and enqueues it.

Environment