allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Numpy Arrays and Lists in Pandas DataFrames get converted to str when returned from pipeline component #1159

Open jokokojote opened 12 months ago

jokokojote commented 12 months ago

Describe the bug

When returning a pandas dataframe from a pipeline component columns of type list or numpy.ndarray change their type to str. This occurs when running the pipeline withPipelineDecorator.run_locally(), but not when using PipelineDecorator.debug_pipeline().

To reproduce

See this minimal example code:

import pandas as pd
from clearml import PipelineDecorator

@PipelineDecorator.component(cache=False, return_values=['df'])
def get_df_with_dummy_vectors():
    import pandas as pd
    import numpy as np

    vectors = np.random.rand(5, 10) # 5x10 example embedding

    data = {
        'id': [1, 2, 3, 4, 5],
        'vector_as_np_array': list(vectors), # storing vectors as no arrays
        'vector_as_list': vectors.tolist() # storing vectors as lists
    }

    df = pd.DataFrame(data)

    # check types - looks good
    print(type(df['vector_as_np_array'][0])) 
    print(type(df['vector_as_list'][0]))    

    return df

@PipelineDecorator.pipeline(name="test-pipeline", project="Test")
def test_pipeline():
    df = get_df_with_dummy_vectors()

    # check types again: 
    # looks good when running with PipelineDecorator.run_locally(), 
    # wrong types when running with PipelineDecorator.debug_pipeline()
    print(type(df['vector_as_np_array'][0]))
    print(type(df['vector_as_list'][0]))

# PipelineDecorator.debug_pipeline()
# test_pipeline()
# # outputs:
# # <class 'numpy.ndarray'>   - correct!
# # <class 'list'>            - correct!
# # <class 'numpy.ndarray'>   - correct!
# # <class 'list'>            - correct!

PipelineDecorator.run_locally()
test_pipeline()
# outputs:
# <class 'numpy.ndarray'>   - correct!
# <class 'list'>            - correct!
# <class 'str'>             - wrong!
# <class 'str'>             - wrong!

Further observations

Interestingly the problem is gone, when adding an additional return value to the pipeline component function and this value IS NOT added to return_values in the decorator :

import pandas as pd
from clearml import PipelineDecorator

@PipelineDecorator.component(cache=False, return_values=['df'])
def get_df_with_dummy_vectors():
    import pandas as pd
    import numpy as np

    vectors = np.random.rand(5, 10) # 5x10 example embedding

    data = {
        'id': [1, 2, 3, 4, 5],
        'vector_as_np_array': list(vectors), # storing vectors as no arrays
        'vector_as_list': vectors.tolist() # storing vectors as lists
    }

    df = pd.DataFrame(data)

    # check types - looks good
    print(type(df['vector_as_np_array'][0])) 
    print(type(df['vector_as_list'][0]))    

    return df, 1 # Add some dummy return value here, which is not part of 'return_values'

@PipelineDecorator.pipeline(name="test-pipeline", project="Test")
def test_pipeline():
    df, v = get_df_with_dummy_vectors()

    # check types again: 
    # Now the types are correct !?
    print(type(df['vector_as_np_array'][0]))
    print(type(df['vector_as_list'][0]))

PipelineDecorator.run_locally()
test_pipeline()
# outputs:
# <class 'numpy.ndarray'>   - correct!
# <class 'list'>            - correct!
# <class 'numpy.ndarray'>   - now correct as well!?
# <class 'list'>            - now correct as well!?

Expected behaviour

Columns of returned dataframes should keep their type between pipeline steps.

Environment

eugen-ajechiloae-clearml commented 12 months ago

Hi @jokokojote ! This happens because we serialize the Pandas DataFrames to CSV, so we lose type info. We will try to find a solution for it

jokokojote commented 12 months ago

Hi @eugen-ajechiloae-clearml, thank you for your fast reply. I thought about this reason as well, but I was irritated that it worked as expected, when adding a second dummy return value (as explained in section Further observations). So it seems to me serialization is not run always, isn't it?

Btw: Could you imagine any disadvantages of using this "hack" for now in my projects until you tackle this topic?

jokokojote commented 12 months ago

PS: I would suggest adding a note about this to the pipeline docs, because it is not made clear how exactlyPipelineDecorator.debug_pipeline behaves differently compared to PipelineDecorator.debug_pipeline, e.g. wrt. the serialisation of data frames.

eugen-ajechiloae-clearml commented 12 months ago

@jokokojote So it seems to me serialization is not run always, isn't it? We use different serialization techniques based on the data type. So when you returned a tuple, we no longer used the CSV serialization.

Btw: Could you imagine any disadvantages of using this "hack" for now in my projects until you tackle this topic? No, it should not have any disadvantage when it comes to functionality