allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Pipeline from Decorators: pipeline and components in different files :: unable to re-run #983

Closed grimmJ04 closed 1 year ago

grimmJ04 commented 1 year ago

Describe the bug

Based on the online documentation for Pipelines from Decorators, one can have a minimalistic example like the one presented in pipeline_from_decorator.py.

However, if I try to place the functions decorated with @PipelineDecorator.component inside another python file in a different directory, I get the following error when trying to re-run the experiment from the web-ui, using the + NEW RUN button.

Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/task_repository/${MY_GIT_REPO_NAME}.git/pipelines/pipeline_demo.py", line 42, in <module>
    executing_pipeline()
  File "/usr/local/lib/python3.11/site-packages/clearml/automation/controller.py", line 4163, in internal_decorator
    a_pipeline._start(wait=False)
  File "/usr/local/lib/python3.11/site-packages/clearml/automation/controller.py", line 1393, in _start
    self._prepare_pipeline(step_task_completed_callback, step_task_created_callback)
  File "/usr/local/lib/python3.11/site-packages/clearml/automation/controller.py", line 1427, in _prepare_pipeline
    if not self._verify():
           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/clearml/automation/controller.py", line 1636, in _verify
    self._verify_node(node)
  File "/usr/local/lib/python3.11/site-packages/clearml/automation/controller.py", line 1652, in _verify_node
    raise ValueError("Node '{}', base_task_id is empty".format(node.name))
ValueError: Node 'step_one', base_task_id is empty

Note that, when run from console, or IDE, the pipeline runs just fine at first. One can wait for the pipeline execution to complete, clone it, make a draft, etc... However, I am not able to re-run this experiment using different, or even the same parameters, due to the error above.

My goal is for my pipeline components is to be reusable. However, this bug(?) prevents me from doing so.

To reproduce

Folder structure

project-folder
+-- pipelines
|   +-- components
|   |   +-- _demo
|   |   |   +-- comp.py
|   +-- pipeline_demo.py
|   +-- _config.py

_config.py

from types import MappingProxyType

ENV_CONFIG = MappingProxyType(dict(
    docker='my-docker-image',  # for this experiment, any image with clearml, scikit-learn, and numpy should be fine
    docker_args='--env GIT_SSL_NO_VERIFY=1 '
                '--env CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=1 '
                '--env CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 '
                '--env LOCAL_PYTHON=python ',
    repo='my-repository',
    repo_branch='branch-in-repo (main)'
))

pipeline_demo.py

from clearml.automation.controller import PipelineDecorator

try:
    from pipelines._config import ENV_CONFIG
except ImportError:
    import sys
    from pathlib import Path

    print('Trying to append parent path to sys.paths ...')
    sys.path.append(str(Path(__file__).parent.parent))

    print('Path is now:')
    print(sys.path)

    from pipelines._config import ENV_CONFIG

env_config = ENV_CONFIG

@PipelineDecorator.pipeline(name='pipeline demo', project='examples', version='0.1.5', **env_config)
def executing_pipeline(seed=42, random_state=42, test_size=0.2):
    print('pipeline args:', dict(seed=seed, random_state=random_state, test_size=test_size))

    print('launch step one')
    from pipelines.components._demo.comp import step_one
    data = step_one(seed=seed)

    print('launch step two')
    from pipelines.components._demo.comp import step_two
    src_train, src_test, tgt_train, tgt_test = step_two(data, test_size=test_size, random_state=random_state)

    print('launch step three')
    from pipelines.components._demo.comp import step_three
    model = step_three(src_train, tgt_train)

    print('returned model: {}'.format(model))

if __name__ == '__main__':
    PipelineDecorator.set_default_execution_queue('my default execution queue')
    executing_pipeline()
    print('process completed')

comp.py

from clearml.automation.controller import PipelineDecorator
from clearml import TaskTypes

from pipelines._config import ENV_CONFIG

env_config = ENV_CONFIG

@PipelineDecorator.component(
    return_values=['data'], cache=True, task_type=TaskTypes.data_processing, **env_config)
def step_one(seed: int = 42):
    print('step_one')

    import numpy as np

    np.random.seed(seed)
    return np.random.rand(512, 9)

@PipelineDecorator.component(
    return_values=['src_train, src_test, tgt_train, tgt_test'],
    cache=True,
    task_type=TaskTypes.data_processing,
    **env_config
)
def step_two(data, test_size=0.2, random_state=42):
    print('step_two')

    from sklearn.model_selection import train_test_split

    x = data[..., :8]
    y = data[..., 8]
    src_train, src_test, tgt_train, tgt_test = train_test_split(
        x, y, test_size=test_size, random_state=random_state
    )

    return src_train, src_test, tgt_train, tgt_test

@PipelineDecorator.component(return_values=['model'], cache=True, task_type=TaskTypes.training, **env_config)
def step_three(x_train, y_train):
    print('step_three')

    from sklearn.linear_model import LinearRegression

    model = LinearRegression()
    model.fit(x_train, y_train)
    return model

Steps to reproduce:

  1. Run the experiment from console, or IDE.
  2. Wait for the experiment to complete (completed status in web-ui).
  3. Click new run, or right click on pipeline and click run.
  4. Fails with the error above (base_task_id is None).

Expected behaviour

I would expect the experiment to run the same way the second time, without errors. Note that, this is true, when the pipeline and its components are in the same file, but not when they are in separate files.

Environment

jkhenning commented 1 year ago

Hi @grimmJ04 , thanks for the detailed description - we're trying to reproduce

eugen-ajechiloae-clearml commented 1 year ago

Hi @grimmJ04 ! We were able to replicate this issue. This happens because of internals. Notice the way you need decorate your steps: PipelineDecorator.component(). component is actually a function that does more than transform your function into a task, it also adds various information to the controller. That information is mostly relevant when running remotely. Because you import your steps here (in the controller, after the controller is created):

@PipelineDecorator.pipeline(name='pipeline demo', project='examples', version='0.1.5', **env_config)
def executing_pipeline(seed=42, random_state=42, test_size=0.2):
    print('pipeline args:', dict(seed=seed, random_state=random_state, test_size=test_size))

    print('launch step one')
    from pipelines.components._demo.comp import step_one

that information is lost :(.

Anyway, the point is: try importing your steps as the first import in the file, such that component() is called before pipeline():

# here
from pipelines.components._demo.comp import step_one, step_two, step_three

from clearml.automation.controller import PipelineDecorator
try:
    from pipelines._config import ENV_CONFIG
except ImportError:
    import sys
    from pathlib import Path
    # other stuff
grimmJ04 commented 1 year ago

Thank you! That really solved my problem. I'm closing this issue.