Open jhostetler opened 1 year ago
For struct fields, if you don't specify the desired type manually, pyarrow indeed will look at all data to infer all possible struct keys, and not infer this from just the first dict. We can see this in a slightly simpler example creating just the array as well:
>>> pa.array([{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]).type
StructType(struct<end: int64, new: int64, start: int64, tag: string>)
This behaviour of "unioning" all possible keys is intentional, looking at the code:
I think there is certainly something to be said about inferring only from the first dictionary, but I think both options are valid, and since this is longstanding behaviour, I don't think it's something we want to change. If you want more control over the resulting data type, you can create the data type manually and specify it, instead of relying on inference.
Now, that's about the fact that we union all observed keys. But it doesn't seem necessary for that reason to also sort them. I don't know if this was intentional, but I assume this is due to the fact that we use a std::map
in C++ to gather all observed keys, and this is a sorted container (but sorted by keys, and not by insertion order). So this sorting seems to be a side effect of the implementation.
The reason that you don't see this sorting behaviour for the top-level fields of a RecordBatch (or Table) is because this actually doesn't create a StructArray (although StructArray and RecordBatch are certainly similar conceptually), so doesn't go through the general inference code, but takes a different code path.
And it actually also seems to have different behaviour regarding fields in later rows:
>>> pa.RecordBatch.from_pylist([{"start": 0, "end": 1, "tag": "foo"}, {"new": 42}]).schema
start: int64
end: int64
tag: string
Here "new" got ignored (silently, which is maybe also not great), but this is explicitly documented in the from_pylist
documentation that if no schema
is specified, the columns will be inferred from the first row.
Describe the bug, including details regarding any error messages, version, and platform.
Code example:
In the 1st schema, the fields of the
struct
are in the same order as the keys in the input dictionary. In the 2nd schema, where thestruct
is nested inside alist
, the fields of thestruct
have been sorted by name. I would expect the ordering to always be the same order as in the input (like in the 1st schema). The more general principle would be that the input -- or at least the first row of input that's used for schema inference -- should validate against the inferred schema.I suspect this behavior is related to
from_pylist()
accepting lists where the elements are dictionaries with different key sets, such as:In this case I'm guessing it sorts the fields because it needs to come up with a canonical ordering. It seems to me that it should just fail to infer a schema here, since again the inputs are not valid according to the inferred schema.
Component(s)
Python