Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.07k stars 2.52k forks source link

Azure wrongly reads Parquet #1713

Open maciejskorski opened 2 years ago

maciejskorski commented 2 years ago

Setup

Python=3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0 + pandas=1.4

Summary

to_pandas_dataframe wrongly reads certain Parquet datasets. Data of some columns appears to be internally shuffled. This was already reported but closed without a fix, due to issues with sharing data publicly. I share the reproducible example below

How to reproduce

from azureml.core import Workspace, Dataset
import tempfile
import pandas as pd

# prepare data: list of sha-values with some None values
df = pd.read_csv('error_data.csv')

# configure Azure storage
ws = Workspace.from_config()
dstore = ws.datastores.get('your datastore')
dstore_path = 'relative datastore path'
target = (dstore,dstore_path)

# write to Azure storage
with tempfile.TemporaryDirectory() as tmpdir:
    df.to_parquet(f'{tmpdir}/df.parquet')
    ds=Dataset.File.upload_directory(tmpdir,target,overwrite=True)

# read by two ways: download and open in pandas or use the Azure connector
with tempfile.TemporaryDirectory() as tmpdir:
    ds=Dataset.File.from_files(target)
    ds.download(tmpdir)
    df1 = pd.read_parquet(tmpdir)
    ds = Dataset.Tabular.from_parquet_files(target)
    df2 = ds.to_pandas_dataframe()

# comparison fails, the data seems displaced :-(
pd.testing.assert_frame_equal(df1,df2)

error_data.csv

Li0425 commented 2 years ago

I'm encountering the same issue :((( I registered multiple parquet files as a Dataset in the workspace. When they are loaded as a dataframe using to_pandas_datafame(), the values are displaced.

maciejskorski commented 2 years ago

@Li0425 do you also have an example to reproduce?

Li0425 commented 2 years ago

@maciejskorski I attempted to run your code with error_data.csv in AML studio but the issue cannot be reproduced (the two dataframes being compared are the same). The version of azureml.core that I'm using is 1.21.0

I created a dataset using AML's to_pandas_dataframe() in Oct 2021 - this dataset has displaced values. When I attempted again today, the output is actually correct. Maybe updating the version of the SDK that you are using could resolve the issue

maciejskorski commented 2 years ago

@Li0425 you have a different config then, and older than mine (1.21 vs 1.36). You could try to reproduce in a tailored virtual env within a compute instance. Or share your own example along with a precise description of your azure libraries?

anliakho2 commented 2 years ago

Thanks @Li0425 and @maciejskorski This is a known issue in arrow-rs crate we are depending on for the parquet reading. Good news is that it was recently fixed and we have integrated the fix in azureml-dataprep==3.0.0 you can proceed with upgrading this package in your current environement, but until azureml-core==1.40.0 is released there would be a warning about incompatibility printed out. As long as your azureml-core version is upgraded to 1.39.* the warning could be safely ignored. azureml-core release with compatible range of versions for azureml-dataprep==3.0.0 should be released next week.

maciejskorski commented 2 years ago

@anliakho2 good news indeed. May I ask you to provide us with a reference to the technical discussion around the root-cause of this bug, like pointing to arrow-rs doc and issues? I think it would be good to see or run in case of doubts more precise tests before adapting or claiming the fix.