apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.37k stars 3.49k forks source link

[Python] Dictionary values are not round-tripping properly from and to pandas #38489

Open fjetter opened 11 months ago

fjetter commented 11 months ago

Describe the bug, including details regarding any error messages, version, and platform.

Round-tripping dictionary values as part of a pandas dataframe causes the output dictionaries to be normalized to a common schema such that keys show up that are not there in the original data.

import pyarrow as pa
import pandas as pd
df = pd.DataFrame({"dicts": [
    {"foo": "bar"},
    {"bar": "baz"},
    {"another": "key"}
]})

image

pa.Table.from_pandas(df).to_pandas()

image

This is both surprising to users and can also grow out of control to a certain degree

Component(s)

Python

jorisvandenbossche commented 8 months ago

Sorry for the slow reply! See my answer here for a similar issue that was reported in pandas: https://github.com/pandas-dev/pandas/issues/56842#issuecomment-1896071745

Bottom line is that in the python->arrow conversion, there are two options for python dicts: struct type or map type. While in this example a map type is the obvious choice, the default is to convert it to struct, which has a fixed set of keys and thus fills all missing keys with null values.

So if you 1) specify a schema when converting to arrow, and 2) specify to get dicts on the conversion back to python, you get a full roundtrip:

>>> schema = pa.schema([("dicts", pa.map_(pa.string(), pa.string()))])
>>> pa.Table.from_pandas(df, schema=schema).to_pandas()
              dicts
0      [(foo, bar)]
1      [(bar, baz)]
2  [(another, key)]
>>> pa.Table.from_pandas(df, schema=schema).to_pandas(maps_as_pydicts="strict")
                dicts
0      {'foo': 'bar'}
1      {'bar': 'baz'}
2  {'another': 'key'}