apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.59k stars 3.54k forks source link

[Python] Schema inference reorders fields in nested structs #34250

Open jhostetler opened 1 year ago

jhostetler commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Code example:

>>> import pyarrow
>>> pyarrow.RecordBatch.from_pylist([{"start": 0, "end": 1, "tag": "foo"}]).schema
start: int64
end: int64
tag: string
>>> pyarrow.RecordBatch.from_pylist([{"spans": [{"start": 0, "end": 1, "tag": "foo"}]}]).schema
spans: list<item: struct<end: int64, start: int64, tag: string>>
  child 0, item: struct<end: int64, start: int64, tag: string>
      child 0, end: int64
      child 1, start: int64
      child 2, tag: string

In the 1st schema, the fields of the struct are in the same order as the keys in the input dictionary. In the 2nd schema, where the struct is nested inside a list, the fields of the struct have been sorted by name. I would expect the ordering to always be the same order as in the input (like in the 1st schema). The more general principle would be that the input -- or at least the first row of input that's used for schema inference -- should validate against the inferred schema.

I suspect this behavior is related to from_pylist() accepting lists where the elements are dictionaries with different key sets, such as:

>>> pyarrow.RecordBatch.from_pylist([{"spans": [{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]}]).schema
spans: list<item: struct<end: int64, new: int64, start: int64, tag: string>>
  child 0, item: struct<end: int64, new: int64, start: int64, tag: string>
      child 0, end: int64
      child 1, new: int64
      child 2, start: int64
      child 3, tag: string

In this case I'm guessing it sorts the fields because it needs to come up with a canonical ordering. It seems to me that it should just fail to infer a schema here, since again the inputs are not valid according to the inferred schema.

Component(s)

Python

jorisvandenbossche commented 1 year ago

For struct fields, if you don't specify the desired type manually, pyarrow indeed will look at all data to infer all possible struct keys, and not infer this from just the first dict. We can see this in a slightly simpler example creating just the array as well:

>>> pa.array([{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]).type
StructType(struct<end: int64, new: int64, start: int64, tag: string>)

This behaviour of "unioning" all possible keys is intentional, looking at the code:

https://github.com/apache/arrow/blob/1d74483fa0659aebc0cb1dfb771ba38800166bf2/python/pyarrow/src/arrow/python/inference.cc#L642-L644

I think there is certainly something to be said about inferring only from the first dictionary, but I think both options are valid, and since this is longstanding behaviour, I don't think it's something we want to change. If you want more control over the resulting data type, you can create the data type manually and specify it, instead of relying on inference.

Now, that's about the fact that we union all observed keys. But it doesn't seem necessary for that reason to also sort them. I don't know if this was intentional, but I assume this is due to the fact that we use a std::map in C++ to gather all observed keys, and this is a sorted container (but sorted by keys, and not by insertion order). So this sorting seems to be a side effect of the implementation.


The reason that you don't see this sorting behaviour for the top-level fields of a RecordBatch (or Table) is because this actually doesn't create a StructArray (although StructArray and RecordBatch are certainly similar conceptually), so doesn't go through the general inference code, but takes a different code path.

And it actually also seems to have different behaviour regarding fields in later rows:

>>> pa.RecordBatch.from_pylist([{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]).schema
start: int64
end: int64
tag: string

Here "new" got ignored (silently, which is maybe also not great), but this is explicitly documented in the from_pylist documentation that if no schema is specified, the columns will be inferred from the first row.