[Python] Dictionary values are not round-tripping properly from and to pandas

apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing

Apache License 2.0

14.37k stars 3.49k forks source link

Sorry for the slow reply! See my answer here for a similar issue that was reported in pandas: https://github.com/pandas-dev/pandas/issues/56842#issuecomment-1896071745

Bottom line is that in the python->arrow conversion, there are two options for python dicts: struct type or map type. While in this example a map type is the obvious choice, the default is to convert it to struct, which has a fixed set of keys and thus fills all missing keys with null values.

So if you 1) specify a schema when converting to arrow, and 2) specify to get dicts on the conversion back to python, you get a full roundtrip:

>>> schema = pa.schema([("dicts", pa.map_(pa.string(), pa.string()))])
>>> pa.Table.from_pandas(df, schema=schema).to_pandas()
              dicts
0      [(foo, bar)]
1      [(bar, baz)]
2  [(another, key)]
>>> pa.Table.from_pandas(df, schema=schema).to_pandas(maps_as_pydicts="strict")
                dicts
0      {'foo': 'bar'}
1      {'bar': 'baz'}
2  {'another': 'key'}

apache / arrow

[Python] Dictionary values are not round-tripping properly from and to pandas #38489

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)