allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.72k stars 657 forks source link

Using `target_project` with `TaskScheduler.add_task()` corrupts project? #1137

Open olucafont6 opened 1 year ago

olucafont6 commented 1 year ago

Describe the bug

When I use the TaskScheduler.add_task() function to schedule a task, and I specify a target_project, I get an error from the running pipeline that it can't find the specified project, and the project becomes hidden in the UI.

To reproduce

Basically I ran a Python script with this code:

from clearml.automation import TaskScheduler

task_scheduler = TaskScheduler(sync_frequency_minutes=15)

task_scheduler.add_task(minute=1, schedule_task_id="903b023a200c4541a485d45125f6e17d",
                        queue="pioneer", target_project="Chat Testing", recurring=False, execute_immediately=True)

task_scheduler.start()

(The task ID was the ID of previously run pipeline run.)

This started the task, but it failed when it got to the add_step case:

from clearml.automation import PipelineController

PROJECT_NAME = "Chat Testing"

pipeline_controller = PipelineController(
    docker="worker",
    name="chat-testing-pipeline",
    project=PROJECT_NAME,
    version="0.0.1",
    add_pipeline_tags=True,
)

pipeline_controller.set_default_execution_queue("pioneer")

pipeline_controller.add_step(
    name="chat-testing-make-requests-step",
    base_task_project=PROJECT_NAME,
    base_task_name="chat-testing-make-requests-task",
    cache_executed_step=False,
)

pipeline_controller.start()
  File "/root/.clearml/venvs-builds/3.11/lib/python3.11/site-packages/clearml/backend_interface/util.py", line 172, in get_single_result
    raise ValueError('No {entity}s found when searching for `{query}`'.format(**locals()))
ValueError: No projects found when searching for `Chat Testing`

The "Chat Testing" project then went partially missing from the UI (you can see it if you select Show Hidden Projects):

image

If I try to run the pipeline / task a different way, I get the same error about not being able to find the project.

It seems like the project somehow got corrupted or something, but I'm not sure how to restore it so it acts normally.

I tried doing the same thing with another project and had the same problem.

Expected behaviour

Using the target_project parameter of TaskScheduler.add_task() would place the cloned Task in the specified project, and not corrupt the project (or whatever is happening).

Environment

eugen-ajechiloae-clearml commented 1 year ago

Hi @olucafont6 ! This seems to be an issue related to the pipeline directory structure. First, to unhide a project, you can do this:

from clearml.backend_api.session.client import APIClient
client = APIClient()
project = client.projects.get_all(name="^Chat Testing$", search_hidden=True, _allow_extra_fields_=True)[0]
system_tags = project.system_tags
system_tags.remove("hidden")
# system_tags.remove("pipeline")   # might also want to remove this tag
client.projects.update(project=project.id, system_tags=system_tags)

Then, to schedule a pipeline, change the target_project to <pipeline_project>/.pipelines/<pipeline_name>:

target_project='Chat Testing/.pipelines/chat-testing-pipeline'

The current behaviour is not ideal. We will make some changes to the scheduler to properly handle pipelines.

olucafont6 commented 1 year ago

@eugen-ajechiloae-clearml Awesome, thanks for the quick response!

I tried out the code you shared and it worked for un-hiding the project - after that I was able to run the pipeline fine (I didn't get the no-projects-found error).

Then, to schedule a pipeline, change the target_project to <pipeline_project>/.pipelines/<pipeline_name>:

target_project='Chat Testing/.pipelines/chat-testing-pipeline'

Interesting - does this mean we can schedule a pipeline without the task ID of a previous run of the pipeline? I tried messing around with the arguments, but it seemed like I could only change what project / pipeline the pipeline would run in with the target_project parameter - e.g., this doesn't work:

task_scheduler.add_task(minute=1, target_project="Chat Testing/.pipelines/chat-testing-pipeline",
                        queue="pioneer", recurring=False, execute_immediately=True)

(It gives this error:)

Traceback (most recent call last):
  File "/home/jovyan/work/{redacted}/{redacted}/scheduler/scheduler.py", line 5, in <module>
    task_scheduler.add_task(minute=1, schedule_function=None, target_project="Chat Testing/.pipelines/chat-testing-pipeline",
  File "/opt/conda/lib/python3.11/site-packages/clearml/automation/scheduler.py", line 618, in add_task
    mutually_exclusive(schedule_function=schedule_function, schedule_task_id=schedule_task_id)
  File "/opt/conda/lib/python3.11/site-packages/clearml/backend_interface/util.py", line 215, in mutually_exclusive
    at_least_one(_exception_cls=_exception_cls, _check_none=_check_none, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/clearml/backend_interface/util.py", line 208, in at_least_one
    raise _exception_cls('At least one of (%s) is required' % ', '.join(kwargs.keys()))
Exception: At least one of (schedule_function, schedule_task_id) is required

Not sure if that's what you meant though.

The current behaviour is not ideal. We will make some changes to the scheduler to properly handle pipelines.

Sounds good - thanks!

eugen-ajechiloae-clearml commented 1 year ago

@olucafont6 You need to specify a schedule_task_id as well

olucafont6 commented 1 year ago

@eugen-ajechiloae-clearml Okay, yeah that's more what the documentation sounded like.

I tried this out and was able to change which pipeline the cloned task / pipeline ran in. Since I wanted to just run the task in the pipeline it was normally a part of though, just leaving out target_project worked fine.

olucafont6 commented 1 year ago

Re my previous question about scheduling a run of the pipeline without an ID of a previous run (https://github.com/allegroai/clearml/issues/1137#issuecomment-1764935198), I was able to get this working with PipelineController.get():

from clearml.automation import TaskScheduler, PipelineController

chat_testing_pipeline = PipelineController.get(
    pipeline_project="Chat Testing", pipeline_name="chat-testing-pipeline")

task_scheduler = TaskScheduler(
    force_create_task_project="Scheduler", force_create_task_name="Scheduling Service")

task_scheduler.add_task(day=1, name="Daily Chat Testing", schedule_task_id=chat_testing_pipeline.id,
                        queue="pioneer", recurring=True, execute_immediately=True)

task_scheduler.start_remotely()

One of the examples makes it look like you can get this to work with Task.get_task(), but that didn't seem to work for me: https://github.com/allegroai/clearml/blob/8a834af777d7c4d1541573158d627c9d39f5c7c5/examples/scheduler/cron_example.py#L15-L24

pollfly commented 10 months ago

Hey @olucafont6! Just letting you know that this issue has been resolved in the recently released v1.14.0