Open maciejskorski opened 2 years ago
I'm encountering the same issue :((( I registered multiple parquet files as a Dataset in the workspace. When they are loaded as a dataframe using to_pandas_datafame(), the values are displaced.
@Li0425 do you also have an example to reproduce?
@maciejskorski I attempted to run your code with error_data.csv in AML studio but the issue cannot be reproduced (the two dataframes being compared are the same). The version of azureml.core that I'm using is 1.21.0
I created a dataset using AML's to_pandas_dataframe() in Oct 2021 - this dataset has displaced values. When I attempted again today, the output is actually correct. Maybe updating the version of the SDK that you are using could resolve the issue
@Li0425 you have a different config then, and older than mine (1.21 vs 1.36). You could try to reproduce in a tailored virtual env within a compute instance. Or share your own example along with a precise description of your azure libraries?
Thanks @Li0425 and @maciejskorski This is a known issue in arrow-rs crate we are depending on for the parquet reading. Good news is that it was recently fixed and we have integrated the fix in azureml-dataprep==3.0.0 you can proceed with upgrading this package in your current environement, but until azureml-core==1.40.0 is released there would be a warning about incompatibility printed out. As long as your azureml-core version is upgraded to 1.39.* the warning could be safely ignored. azureml-core release with compatible range of versions for azureml-dataprep==3.0.0 should be released next week.
@anliakho2 good news indeed. May I ask you to provide us with a reference to the technical discussion around the root-cause of this bug, like pointing to arrow-rs
doc and issues? I think it would be good to see or run in case of doubts more precise tests before adapting or claiming the fix.
Setup
Python=3.8 + azureml-core=1.36.0 + azureml-dataprep=2.26.0 + pyarrow=7.0.0 + pandas=1.4
Summary
to_pandas_dataframe
wrongly reads certain Parquet datasets. Data of some columns appears to be internally shuffled. This was already reported but closed without a fix, due to issues with sharing data publicly. I share the reproducible example belowHow to reproduce
error_data.csv