Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.08k stars 2.52k forks source link

Register pandas dataframe memory issues #1585

Open byronverz opened 3 years ago

byronverz commented 3 years ago

When using the register_pandas_dataframe() method as suggested by the tip here, I get a system memory error:

Error Code: ScriptExecution.ReadDataFrame.Unexpected 
Failed Step: 1d7c20d4-6b70-4075-8b68-74f51264bf5b 
Error Message: ScriptExecutionException was caused by ReadDataFrameException. Unexpected exception during ReadDataframeFromSocket. 
Failed to read DataFrame from host. Exception of type 'System.OutOfMemoryException' was thrown.

When I run the same method with the same dataframe from a jupyter notebook it works as expected. I have ensured I have enough memory on my system when running the script (1.42 MiB dataframe and 6GB RAM free) so I don't think that's an issue. I know this method is experimental, so I am using the older method and it's working fine.

Another issue that sometimes happens (if it's not the system memory issue) is a streamAccessValidation error

azureml.dataprep.api.errorhandlers.ExecutionError: 
Error Code: ScriptExecution.ReadDataFrame.StreamAccess.Validation
Validation Error Code: Invalid
Validation Target: PreppyFile
Failed Step: bfd9a1d5-c01f-485a-8761-99cd0a41d0c3
Error Message: ScriptExecutionException was caused by ReadDataFrameException.
  Failed to read Pandas DataFrame form Python host. Make sure Dataflow is created directly from the source Pandas DataFrame.
    StreamAccessException was caused by ValidationException.
      Trying to read an invalid file. Missing sentinel value in the beginning
| session_id=710cc9c3-4478-4be9-998c-0e4a009800f5

Again, this does not happen when using the method from a jupyter notebook, only when running a script on my local machine.


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

kicha0 commented 3 years ago

Hi byronverz,

We would like to get a repro from our end for the two issues that you met. Two options here:

For the 2nd issue with the following error: Failed to read Pandas DataFrame form Python host. Make sure Dataflow is created directly from the source Pandas DataFrame. StreamAccessException was caused by ValidationException.

Is the pandas dataframe by any chance created by concat or related api calls that merge two dataframe into one? One possible reason for this error is due to datatype in data not matching the declared schema in pandas.

byronverz commented 3 years ago

Since the data is private, here is the info you requested using df.info():

RangeIndex: 18577 entries, 0 to 18576
Data columns (total 10 columns):
Col1      18577 non-null object
Col2      18577 non-null object
Col3      18577 non-null object
Col4      3905 non-null object
Col5      18577 non-null datetime64[ns]
Col6      18577 non-null object
Col7      18577 non-null Int64
Col8      18577 non-null Int64
Col9      18577 non-null object
Col10    18577 non-null object
dtypes: Int64(2), datetime64[ns](1), object(7)
memory usage: 1.5+ MB

For the second issue. Yes I create the dataframe by concatenating multiple subframes from a list of csv files which I then concatenate and clean up. I use the same method in a Jupyter Notebook and it works, but when running the process in a script, that's when the error arises.

kicha0 commented 3 years ago

For first issue: We need to know the actual type in pandas object - can you do the following and show the output?

1) for c in df.columns: print(type(df[c].iloc[df[c].first_valid_index()])) 2) df.info(memory_usage='deep') 3) import sys sys.getsizeof(df)

For the second issue - can you

1) do a df.info() for the dataframe produced from each of the csv file before concat? 2) Show the csv reading code. Do you specify schema? 3) Please also show the cleanup code

byronverz commented 3 years ago

For the first issue:

  1. Sorry I see now it didn't define all the objects as strings.
    Column Col1 type: <class 'str'>
    Column Col2 type: <class 'str'>
    Column Col3 type: <class 'str'>
    Column Col4  type: <class 'str'>
    Column Col5 type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
    Column Col6 type: <class 'str'>
    Column Col7 type: <class 'numpy.int64'>
    Column Col8 type: <class 'numpy.int64'>
    Column Col9 type: <class 'str'>
    Column Col10 type: <class 'str'>
  2. <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 18577 entries, 0 to 18576
    Data columns (total 10 columns):
    Col1      18577 non-null object
    Col2      18577 non-null object
    Col3         18577 non-null object
    Col4       3905 non-null object
    Col5            18577 non-null datetime64[ns]
    Col6         18577 non-null object
    Col7     18577 non-null Int64
    Col8      18577 non-null Int64
    Col9    18577 non-null object
    Col10    18577 non-null object
    dtypes: Int64(2), datetime64[ns](1), object(7)
    memory usage: 8.5 MB
    None
  3. sys.getsizeof(df) >>> Out[2]: 8942497

For the second issue:

  1. I am currently only reading and concatenating 2 csv files. Here's the info for the frame made from the first one:
    RangeIndex: 10000 entries, 0 to 9999
    Data columns (total 6 columns):
    Col1           10000 non-null object <class 'str'>
    Col2           10000 non-null object <class 'str'>
    Col3           10000 non-null object <class 'str'>
    Col4           4356 non-null object <class 'str'>
    Col5           10000 non-null object <class 'str'>
    Col6           4100 non-null float64 <class 'numpy.float64'>
    dtypes: float64(1), object(5)
    memory usage: 3.0 MB

    And the second one:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 10000 entries, 0 to 9999
    Data columns (total 7 columns):
    Col1           10000 non-null object <class 'str'>
    Col2           10000 non-null object <class 'str'>
    Col3           9742 non-null object <class 'str'>
    Col4           186 non-null object <class 'str'>
    Col5           9747 non-null object <class 'str'>
    Col6           9370 non-null object <class 'str'>
    Col7           9755 non-null object <class 'str'>
    dtypes: object(7)
    memory usage: 4.2 MB

    Note the code below where col6 differs (str and float64). I use the content of Col6 in the first imported frame to create a populate Col7 prior to concatenation, once the frames are concatenated I think drop Col6 of the entire frame.

  2. def get_frame(database):
    target_col_list = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7']
    with open("db_columns.json") as db_defs:
        db_def_json = json.load(db_defs)
    if type(database) is not list:
        database = [database]
    frame_list = []
    for d in database: 
        frame_col_map = {b: t for t, b in zip(target_col_list, db_def_json[d])}
        csv_path = '{}.csv'.format(d)
        temp_df = pd.read_csv(csv_path)
        temp_df = temp_df[db_def_json[d]]
        temp_df = temp_df.rename(columns=frame_col_map)
        if d == 'astro_db':
            temp_df.loc[:, 'Col6'] = temp_df.loc[:, 'Col6'].astype(dtype='Int64')
            id_list = ~temp_df['Col6'].isna()
            id_list = id_list.replace([True, False], ['HAS_ID', np.nan])
            temp_df['Col7'] = id_list
        frame_list.append(temp_df)
    
    frame = pd.concat(frame_list)
    frame = frame.drop(columns=['Col7'])
    frame = frame.reset_index(drop=True)
    return frame
  3. In terms of the clean up code it is a fair bit so I will sift through the steps to ensure a type mismatch isn't happening and get back to you. Essentially it looks as follows:
    
    imports *

def import_from_sources(df): import operations return df

def cleanup1(df): cleanup operations return df

def cleanup2(df): cleanup operations return df . . . def cleanupn(df): cleanup operations return df

def main(workspace): df = import_from_sources(sources_list) df = cleanup1(df) df = cleanup2(df) . . . df = cleanupn(df) dataset = Dataset.Tabular.register_pandas_dataframe(dataframe=df)

if name == 'main': ws = Workspace.from_config() main(ws)


I hope this is sufficient if not please let me know and I'll try my best to provide you with more details. I didn't know about the type mismatching within a df so I will go through the processing steps to ensure this is not the case.
kicha0 commented 3 years ago

Hi byronverz,

For the first issue: I used the following script to generate a random dataset matching the schema you provided: pd_repro_oom.zip

However I was not able to reproduce the OOM issue. I also was not able to observe abnormality in memory usage during execution. For next step, could you help run the repro script and see if this script could reproduce your issue?

For your jupyter and non-jupyter envionrment, could you help make sure they are on the same version for azureml-core and azureml-dataset-runtime? You may run pip install azureml-sdk --upgrade to upgrade all of the related packages.

For the 2nd issue: Have you been able to check on the datatypes of the frames being concat together? Specifically, it might be helpful to do a df.info() on each frame in the frame_list and verify the pandas data type there.

If there's no mismatch there, I will provide a code snippet to check the actual data content type for each individual record in the merged dataframe.