Open byronverz opened 3 years ago
Hi byronverz,
We would like to get a repro from our end for the two issues that you met. Two options here:
For the 2nd issue with the following error: Failed to read Pandas DataFrame form Python host. Make sure Dataflow is created directly from the source Pandas DataFrame. StreamAccessException was caused by ValidationException.
Is the pandas dataframe by any chance created by concat or related api calls that merge two dataframe into one? One possible reason for this error is due to datatype in data not matching the declared schema in pandas.
Since the data is private, here is the info you requested using df.info()
:
RangeIndex: 18577 entries, 0 to 18576
Data columns (total 10 columns):
Col1 18577 non-null object
Col2 18577 non-null object
Col3 18577 non-null object
Col4 3905 non-null object
Col5 18577 non-null datetime64[ns]
Col6 18577 non-null object
Col7 18577 non-null Int64
Col8 18577 non-null Int64
Col9 18577 non-null object
Col10 18577 non-null object
dtypes: Int64(2), datetime64[ns](1), object(7)
memory usage: 1.5+ MB
For the second issue. Yes I create the dataframe by concatenating multiple subframes from a list of csv files which I then concatenate and clean up. I use the same method in a Jupyter Notebook and it works, but when running the process in a script, that's when the error arises.
For first issue: We need to know the actual type in pandas object - can you do the following and show the output?
1) for c in df.columns: print(type(df[c].iloc[df[c].first_valid_index()])) 2) df.info(memory_usage='deep') 3) import sys sys.getsizeof(df)
For the second issue - can you
1) do a df.info() for the dataframe produced from each of the csv file before concat? 2) Show the csv reading code. Do you specify schema? 3) Please also show the cleanup code
For the first issue:
Column Col1 type: <class 'str'>
Column Col2 type: <class 'str'>
Column Col3 type: <class 'str'>
Column Col4 type: <class 'str'>
Column Col5 type: <class 'pandas._libs.tslibs.timestamps.Timestamp'>
Column Col6 type: <class 'str'>
Column Col7 type: <class 'numpy.int64'>
Column Col8 type: <class 'numpy.int64'>
Column Col9 type: <class 'str'>
Column Col10 type: <class 'str'>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18577 entries, 0 to 18576
Data columns (total 10 columns):
Col1 18577 non-null object
Col2 18577 non-null object
Col3 18577 non-null object
Col4 3905 non-null object
Col5 18577 non-null datetime64[ns]
Col6 18577 non-null object
Col7 18577 non-null Int64
Col8 18577 non-null Int64
Col9 18577 non-null object
Col10 18577 non-null object
dtypes: Int64(2), datetime64[ns](1), object(7)
memory usage: 8.5 MB
None
sys.getsizeof(df) >>> Out[2]: 8942497
For the second issue:
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
Col1 10000 non-null object <class 'str'>
Col2 10000 non-null object <class 'str'>
Col3 10000 non-null object <class 'str'>
Col4 4356 non-null object <class 'str'>
Col5 10000 non-null object <class 'str'>
Col6 4100 non-null float64 <class 'numpy.float64'>
dtypes: float64(1), object(5)
memory usage: 3.0 MB
And the second one:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 7 columns):
Col1 10000 non-null object <class 'str'>
Col2 10000 non-null object <class 'str'>
Col3 9742 non-null object <class 'str'>
Col4 186 non-null object <class 'str'>
Col5 9747 non-null object <class 'str'>
Col6 9370 non-null object <class 'str'>
Col7 9755 non-null object <class 'str'>
dtypes: object(7)
memory usage: 4.2 MB
Note the code below where col6 differs (str and float64). I use the content of Col6 in the first imported frame to create a populate Col7 prior to concatenation, once the frames are concatenated I think drop Col6 of the entire frame.
def get_frame(database):
target_col_list = ['Col1', 'Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7']
with open("db_columns.json") as db_defs:
db_def_json = json.load(db_defs)
if type(database) is not list:
database = [database]
frame_list = []
for d in database:
frame_col_map = {b: t for t, b in zip(target_col_list, db_def_json[d])}
csv_path = '{}.csv'.format(d)
temp_df = pd.read_csv(csv_path)
temp_df = temp_df[db_def_json[d]]
temp_df = temp_df.rename(columns=frame_col_map)
if d == 'astro_db':
temp_df.loc[:, 'Col6'] = temp_df.loc[:, 'Col6'].astype(dtype='Int64')
id_list = ~temp_df['Col6'].isna()
id_list = id_list.replace([True, False], ['HAS_ID', np.nan])
temp_df['Col7'] = id_list
frame_list.append(temp_df)
frame = pd.concat(frame_list)
frame = frame.drop(columns=['Col7'])
frame = frame.reset_index(drop=True)
return frame
imports *
def import_from_sources(df): import operations return df
def cleanup1(df): cleanup operations return df
def cleanup2(df): cleanup operations return df . . . def cleanupn(df): cleanup operations return df
def main(workspace): df = import_from_sources(sources_list) df = cleanup1(df) df = cleanup2(df) . . . df = cleanupn(df) dataset = Dataset.Tabular.register_pandas_dataframe(dataframe=df)
if name == 'main': ws = Workspace.from_config() main(ws)
I hope this is sufficient if not please let me know and I'll try my best to provide you with more details. I didn't know about the type mismatching within a df so I will go through the processing steps to ensure this is not the case.
Hi byronverz,
For the first issue: I used the following script to generate a random dataset matching the schema you provided: pd_repro_oom.zip
However I was not able to reproduce the OOM issue. I also was not able to observe abnormality in memory usage during execution. For next step, could you help run the repro script and see if this script could reproduce your issue?
For your jupyter and non-jupyter envionrment, could you help make sure they are on the same version for azureml-core and azureml-dataset-runtime? You may run pip install azureml-sdk --upgrade to upgrade all of the related packages.
For the 2nd issue: Have you been able to check on the datatypes of the frames being concat together? Specifically, it might be helpful to do a df.info() on each frame in the frame_list and verify the pandas data type there.
If there's no mismatch there, I will provide a code snippet to check the actual data content type for each individual record in the merged dataframe.
When using the
register_pandas_dataframe()
method as suggested by the tip here, I get a system memory error:When I run the same method with the same dataframe from a jupyter notebook it works as expected. I have ensured I have enough memory on my system when running the script (1.42 MiB dataframe and 6GB RAM free) so I don't think that's an issue. I know this method is experimental, so I am using the older method and it's working fine.
Another issue that sometimes happens (if it's not the system memory issue) is a streamAccessValidation error
Again, this does not happen when using the method from a jupyter notebook, only when running a script on my local machine.
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.