allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.7k stars 657 forks source link

Can't pass data from one pipeline function step to another - "Could not retrieve a local copy of artifact" #1346

Closed kiranzo closed 2 weeks ago

kiranzo commented 2 weeks ago

Describe the bug

I am building my own pipeline from this example. The first step should retrieve a list of all the valid data ids, and the second one should send the data to normalization. However, I cannot pass it directly, because I get ValueError: Could not retrieve a local copy of artifact uids, failed downloading http://server-address/.../uids.pkl and a message in debug console: clearml.storage - ERROR - Could not download http://server-address/.../artifacts/uids/uids.pkl , err: [Errno 13] Permission denied: '/path/to/clearml/.clearml/cache/storage_manager/global/0a4e72a05fa23ad670cfa0593c461777.uids.pkl_1730990556.4492965.partially'

This is the clearml-generated Task code that throws this error:

    for k, v in params.items():
        if not v or not k.startswith('kwargs_artifacts/'):
            continue
        k = k.replace('kwargs_artifacts/', '', 1)
        task_id, artifact_name = v.split('.', 1)
        parent_task = Task.get_task(task_id=task_id)
        if artifact_name in parent_task.artifacts:
            kwargs[k] = parent_task.artifacts[artifact_name].get(deserialization_function=None)

I tried to execute parent_task.artifacts[artifact_name].get(deserialization_function=None) in debug console, and got the same error. However, I can open the link http://server-address/.../artifacts/uids/uids.pkl in browser and download it manually just fine. Also, I can see the pickled list among the artifacts of the 1st step in the Web UI.

To reproduce

1st step:

def get_protocol_data(
    # insert params here
):
# insert logic here
return list(uids)

2nd step:

def normalize(
    # insert other params here
    uids: List[str]
):
    success = True
    # use uids for filtering data here
    return success 

pipeline code:

if __name__ == "__main__":
    pipe = PipelineController(
        project=PROJECT_NAME,
        name=PIPELINE_NAME,
        version=VERSION,
        add_pipeline_tags=True
    )
    pipe.set_default_execution_queue(STEP_QUEUE)
    config = pipe.connect_configuration(
        configuration="configs/pipeline_config.yaml", name="Config"
    )

    pipe.add_function_step(
        name="get_protocol_data",
        function=get_protocol_data,
        function_kwargs={
            # insert kwargs
        },
        function_return=["uids"],
        cache_executed_step=False,
        repo=REPOSITORY,
        repo_branch=BRANCH,
        # project_name=PROJECT_STEPS,
        pre_execute_callback=step_created_callback,
    )

    pipe.add_function_step(
        name="normalize",
        function=normalize,
        function_kwargs={
            # insert other kwargs
            "uids": "${get_protocol_data.uids}"
        },
        function_return=["success"],
        cache_executed_step=False,
        repo=REPOSITORY,
        repo_branch=BRANCH,
        # project_name=PROJECT_STEPS,
        pre_execute_callback=step_created_callback,
    )

    pipe.start_locally(run_pipeline_steps_locally=True)

I also tried to change add_function_step function_kwargs to "uids": dict(uids="${get_protocol_data.uids}") like in your example or return uids as Pandas DataFrame instead of list, but got the same error.

Expected behaviour

Passing output of one step as input of another without any errors.

Environment

eugen-ajechiloae-clearml commented 2 weeks ago

Hi @kiranzo ! This looks like an OS error related to writing to /path/to/clearml/.clearml/cache/storage_manager/global/0a4e72a05fa23ad670cfa0593c461777.uids.pkl_1730990556.4492965.partially. It may be that the cache directory is read-only or something similar. You could try the following code to check if you can write to that file:

from pathlib import Path
path = Path("/path/to/clearml/.clearml/cache/storage_manager/global/0a4e72a05fa23ad670cfa0593c461777.uids.pkl_1730990556.4492965.partially")
path.parent.mkdir(parents=True, exist_ok=True)
with open(path.as_posix(), "wb") as f:
    f.write(b"hello world")
kiranzo commented 2 weeks ago

@eugen-ajechiloae-clearml thanks for the answer, but now I'm confused about which host the pickled files were supposed to go to when I'm running the pipeline locally. I have ~/.clearml/cache/storage_manager/global on this computer, but there aren't any files with any mention of 'uids' in them, and nothing ends with 'partially', it's mostly 'numbers-letters.processed_data.pkl' and 'numbers-letters.data_frame.csv.gz' there. This computer also has clearml agent running on it, so those might be artifacts from the other pipelines and tasks we were launching via agent.

I connected to the clearml server host, but it doesn't have ~/.clearml directory at all.

Anyway, I tried to run your code on the host I'm testing my pipeline on, but it said PermissionError: [Errno 13] Permission denied: '/path/to/clearml/.clearml/cache/storage_manager/global/0a4e72a05fa23ad670cfa0593c461777.uids.pkl_1730994367.1966016.partially'

UPD I tried to do a workaround by registering list of uid as an artifact in the first step, passing its task id to the second step, and trying to retrieve it via clearml.Task API, but I just got the same error. I'm running out of options here 😱

eugen-ajechiloae-clearml commented 2 weeks ago

@kiranzo pickled files will be pulled to ~/.clearml/cache/storage_manager/global when running locally. So the problem is that you can't write files to /path/to/clearml/.clearml/cache/storage_manager/global for some reason . You might want to do something like sudo chmod +w /path/to/clearml/.clearml/cache/storage_manager/global to change the permissions.

kiranzo commented 2 weeks ago

@eugen-ajechiloae-clearml is it normal that I can't see the pickled files with a word 'uids' in their names though?

kiranzo commented 2 weeks ago

@eugen-ajechiloae-clearml It works with writing permissions, thanks for your help!