Open fjetter opened 11 months ago
Sorry for the slow reply! See my answer here for a similar issue that was reported in pandas: https://github.com/pandas-dev/pandas/issues/56842#issuecomment-1896071745
Bottom line is that in the python->arrow conversion, there are two options for python dicts: struct type or map type. While in this example a map type is the obvious choice, the default is to convert it to struct, which has a fixed set of keys and thus fills all missing keys with null values.
So if you 1) specify a schema when converting to arrow, and 2) specify to get dicts on the conversion back to python, you get a full roundtrip:
>>> schema = pa.schema([("dicts", pa.map_(pa.string(), pa.string()))])
>>> pa.Table.from_pandas(df, schema=schema).to_pandas()
dicts
0 [(foo, bar)]
1 [(bar, baz)]
2 [(another, key)]
>>> pa.Table.from_pandas(df, schema=schema).to_pandas(maps_as_pydicts="strict")
dicts
0 {'foo': 'bar'}
1 {'bar': 'baz'}
2 {'another': 'key'}
Describe the bug, including details regarding any error messages, version, and platform.
Round-tripping dictionary values as part of a pandas dataframe causes the output dictionaries to be normalized to a common schema such that keys show up that are not there in the original data.
This is both surprising to users and can also grow out of control to a certain degree
Component(s)
Python