[Python] Schema inference reorders fields in nested structs

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache License 2.0

14.59k stars 3.54k forks source link

>>> import pyarrow >>> pyarrow.RecordBatch.from_pylist([{"start": 0, "end": 1, "tag": "foo"}]).schema start: int64 end: int64 tag: string >>> pyarrow.RecordBatch.from_pylist([{"spans": [{"start": 0, "end": 1, "tag": "foo"}]}]).schema spans: list<item: struct<end: int64, start: int64, tag: string>> child 0, item: struct<end: int64, start: int64, tag: string> child 0, end: int64 child 1, start: int64 child 2, tag: string

>>> pyarrow.RecordBatch.from_pylist([{"spans": [{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]}]).schema spans: list<item: struct<end: int64, new: int64, start: int64, tag: string>> child 0, item: struct<end: int64, new: int64, start: int64, tag: string> child 0, end: int64 child 1, new: int64 child 2, start: int64 child 3, tag: string

For struct fields, if you don't specify the desired type manually, pyarrow indeed will look at all data to infer all possible struct keys, and not infer this from just the first dict. We can see this in a slightly simpler example creating just the array as well:

>>> pa.array([{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]).type
StructType(struct<end: int64, new: int64, start: int64, tag: string>)

This behaviour of "unioning" all possible keys is intentional, looking at the code:

https://github.com/apache/arrow/blob/1d74483fa0659aebc0cb1dfb771ba38800166bf2/python/pyarrow/src/arrow/python/inference.cc#L642-L644

I think there is certainly something to be said about inferring only from the first dictionary, but I think both options are valid, and since this is longstanding behaviour, I don't think it's something we want to change. If you want more control over the resulting data type, you can create the data type manually and specify it, instead of relying on inference.

Now, that's about the fact that we union all observed keys. But it doesn't seem necessary for that reason to also sort them. I don't know if this was intentional, but I assume this is due to the fact that we use a std::map in C++ to gather all observed keys, and this is a sorted container (but sorted by keys, and not by insertion order). So this sorting seems to be a side effect of the implementation.

The reason that you don't see this sorting behaviour for the top-level fields of a RecordBatch (or Table) is because this actually doesn't create a StructArray (although StructArray and RecordBatch are certainly similar conceptually), so doesn't go through the general inference code, but takes a different code path.

And it actually also seems to have different behaviour regarding fields in later rows:

>>> pa.RecordBatch.from_pylist([{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]).schema
start: int64
end: int64
tag: string

Here "new" got ignored (silently, which is maybe also not great), but this is explicitly documented in the from_pylist documentation that if no schema is specified, the columns will be inferred from the first row.

apache / arrow

[Python] Schema inference reorders fields in nested structs #34250

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)