allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.67k stars 654 forks source link

pipelines stuck on remote #1328

Open lolpa1n opened 1 month ago

lolpa1n commented 1 month ago

Hello,

I deployed clearml server on my machine and wanted to make pipelines:

my code:

from clearml import PipelineController

def read_data(csv_path):
    import pandas as pd
    import os

    df = pd.read_csv(csv_path)

    return df

def log_data(df):
    from clearml import Logger

    logger = Logger.current_logger()
    logger.report_table(title=f'Dataframe FULL', series='pandas DataFrame', iteration=1, table_plot=df)
    print("Logged data successfully")
    return True

if __name__ == '__main__':
    pipe = PipelineController(
        project='ASMBT', 
        name='test_pipe',
        version='1.0',
        add_pipeline_tags=True,
        # working_dir='.'
    )
    pipe.add_parameter(
        name='csv_path',
        description='path to csv file', 
        default='../data/data.csv'
    )
    pipe.add_function_step(
        name='read_data',
        function=read_data,
        function_kwargs=dict(csv_path='${pipeline.csv_path}'),
        function_return=['data_frame'],
        cache_executed_step=False,
        execution_queue='test_gpu'
    )
    pipe.add_function_step(
        name='log_data',
        function=log_data,
        function_kwargs=dict(df='${read_data.data_frame}'),
        cache_executed_step=False,
        execution_queue='test_gpu'
    )
    # pipe.start_locally(run_pipeline_steps_locally=True)
    pipe.start(queue='test_gpu')
    print('pipeline completed')

if I execute: pipe.start_locally(run_pipeline_steps_locally=True) then everything works, but if I change to pipe.start(queue='test_gpu'), after clearml-agent daemon --detached --queue test_gpu --gpus 0:

then nothing happens and the green status - QUEUED logs:

Launching the next 1 steps
Launching step [read_data]
2024-09-12 17:16:01
Launching step: read_data
Parameters:
{'kwargs/csv_path': '${pipeline.csv_path}'}
Configurations:
{}
Overrides:
{} 

Tell me pls, how to do this correctly, if I want, for example, to select a specific GPU, etc. for launch

ainoam commented 1 month ago

@lolpa1n As you'll see in this example, the queue specified in PipelineController.start is the one through which the controller itself will be executed. The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue

lolpa1n commented 1 month ago

@lolpa1n As you'll see in this example, the queue specified in PipelineController.start is the one through which the controller itself will be executed. The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue

i am add pipe.set_default_execution_queue(default_execution_queue='test_gpu') but nothing happens - image

maybe is this because I'm running on the same machine?

suparshukov commented 1 month ago

Hello, I have a similar problem. I run pipelines in remote execution mode. It gets into the queue and the agent starts the container. Sometimes the first stage works and the second stage doesn't start - Launching step.... And sometimes the first stage doesn't launch either - just the Launching step [stage name] message hangs without errors.... When run_locally() is executed, the pipeline works completely

ainoam commented 1 month ago

@lolpa1n Sounds like the issue is more that you are using the same agent (you can deploy multiple agents on the same machine) - It can't take care of the steps since its busy handling the pipeline controller which in turn is waiting for the steps to complete.

@suparshukov Not sure the same applies for your use case? I think you'll need to take a look at the logs inside the container that appears as if its not doing anything to better isolate the issue.

kiranzo commented 3 weeks ago

Experimenting with steps from functions and draft=True option, getting the same result - the first step of my pipeline just hangs indefinitely. I have 2 clearml agents which each have each its own queue: one is --cpu-only and handles pipeline controller, and another one uses GPU and its queue is set as default.

Also, I'm using freshly built docker image that I didn't push into our artifactory. I set it on ClearML UI, and during the execution it said: Error response from daemon: pull access denied for clearml_worker_etl_test, repository does not exist or may require 'docker login': denied: requested access to the resource is denied If it went ahead and tried to pull it from the artifactory, couldn't find it, and then went ahead with the execution anyway, where the heck does it execute the pipeline?

jkhenning commented 3 days ago

Hi @kiranzo , can you include the complete log?