apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.99k stars 3.41k forks source link

[Python] Data of struct fields are our-of-order in parquet files created by the write_table() method #27241

Open asfimport opened 3 years ago

asfimport commented 3 years ago

Hi,

We found an out-of-order issue with the 'struct' data type recently, would like to know if you can help to root cause it.


import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('./test_struct.csv')
print(df.dtypes)
df['full_name'] = df.apply(lambda x: {"package": x['file_package'], "name": x["file_name"]}, axis=1)
my_df = df.drop(['file_package', 'file_name'], axis=1)

file_fields = [('package', pa.string()), ('name', pa.string()),]
my_schema = pa.schema([pa.field('full_name', pa.struct(file_fields)),
                       pa.field('fruit_name', pa.string())])
my_table = pa.Table.from_pandas(my_df, schema = my_schema)
print('Table schema:')
print(my_table.schema)

pq.write_table(my_table, './test_struct_200.parquet')

The above code (attached as test_struct_200.py) runs with the following python packages:


Pandas Version = 1.1.3
PyArrow Version = 2.0.0

Then I use parquet-tools (1.11.1) to read the file, but get the following output:


$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
...
full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = apple.csv
fruit_name = strawberry

(BTW, you can also view the parquet file with http://parquet-viewer-online.com/)

The output is supposed to be (refer to test_struct.csv) :


$ java -jar parquet-tools-1.11.1.jar head -n 2181 test_struct_200.parquet
...
full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

full_name:
.package = fruit.zip
.name = strawberry.csv
fruit_name = strawberry

As a comparison, the following code (attached as test_struct_200_flat.py) would generate a parquet file with the same data of test_struct.csv:


import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_csv('./test_struct.csv')
print(df.dtypes)
my_schema = pa.schema([pa.field('file_package', pa.string()),
                       pa.field('file_name', pa.string()),
                       pa.field('fruit_name', pa.string())])
my_table = pa.Table.from_pandas(df, schema = my_schema)
print('Table schema:')
print(my_table.schema)

pq.write_table(my_table, './test_struct_200_flat.parquet')

I also attached the two parquet files for your references.

Reporter: Chen Ming

Original Issue Attachments:

Note: This issue was originally created as ARROW-11344. Please see the migration documentation for further details.

asfimport commented 3 years ago

Weston Pace / @westonpace: Thank you for creating such a detailed test case.  I have run your test against pyarrow 2.0.0 and I can confirm I get the same results that you do.  Luckily, when I ran your test against the latest code I did not see this error and I confirmed that the full_name.name column aligned with the fruit_name column.  We have recently fixed issues related to structs such as ARROW-10493 and my assumption is that you encountered one of those.

We are on the verge of releasing 3.0.0.  There is an RC available at (https://bintray.com/apache/arrow/python-rc/3.0.0-rc2#files/python-rc/3.0.0-rc2) if you would like to test this behavior out yourself sooner.

 

asfimport commented 3 years ago

Chen Ming: @westonpace  Thank you for the information. And very happy to see 3.0.0 has been released to PyPI this morning. From my quick test with the example data, the issue has been fixed by PyArrow 3.0.0.

We want to do more testing (with our production data), so I would like to keep this Jira in open state for a few more days.

asfimport commented 3 years ago

Joris Van den Bossche / @jorisvandenbossche: I think it would be good to still extract a test case from your example to add to the test suite, if possible.