apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.46k stars 3.52k forks source link

[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

Open asfimport opened 2 years ago

asfimport commented 2 years ago

When trying to save a Pandas dataframe with a nested type (list within list, list within dict) using pyarrow engine, the following error is encountered

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column A with type object')

 

Repro:


import pandas as pd
x = pd.DataFrame({"A": [[24, 27, [1, 1]]]})
x.to_parquet('/tmp/a.pqt', engine="pyarrow")  

Doing a bit of googling, it appears that this is a known Arrow shortcoming. However, this is a commonly encountered datastructure, and 'fastparquet' handles this seamlessly. Is there a proposed timeline/plan for fixing this?

Reporter: Karthik

Note: This issue was originally created as ARROW-15142. Please see the migration documentation for further details.

khoatrandata commented 1 year ago

Hi, is there any update on this issue, please?

westonpace commented 1 year ago

Sorry, I'm not entirely sure what the type you are looking for is. Currently you are providing:

values
24
27
[1, 1]

Columns must be homogeneous within Arrow / Parquet. If you want list-within-list you should provide

values
[24]
[27]
[1, 1]

You can achieve this with:

x = pd.DataFrame({"A": [[[24], [27], [1, 1]]]})
khoatrandata commented 1 year ago

hi @westonpace , please see this for a more realistic example

westonpace commented 1 year ago

@khoatrandata if I understand that issue correctly, the user is trying to load a column (with type=jsonb) into Arrow. There is no equivalent Arrow data type (and as far as I can tell no one has ever asked for it before). I think a variable-length binary column should be sufficient for many purposes.

It looks like the current approach is to first load the column into python objects (this will give you a heterogeneous list of python objects). This list is then passed to pa.array. however, there is no guarantee you will be able to turn that into an Arrow array and there is no knowing what the result will be (if all the values are numbers you'll get an int64 array. If all the values are strings you'll get a string array, if the values are mixed you'll get the reported exception).

If the goal is to go to parquet and back then the safest thing to do would be to load the column as binary and save it in parquet as binary (with your own custom metadata to indicate it is a JSONB field).

You could also create a JSONB extension type based on the variable length binary data type.