Open aersam opened 8 months ago
Btw, Polars does this "correctly":
import polars
print(polars.from_dicts([{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]))
gives:
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
│ 1 ┆ 2 ┆ null │
│ 3 ┆ 4 ┆ 5 │
The current behaviour is somewhat documented:
schema : Schema, default None
If not passed, will be inferred from the first row of the
mapping values.
But I agree this might be surprising, or in any case that it can be useful to change that behaviour.
Workaround on the short term is to use pa.array
to infer the list of dicts to a StructArray, which has the desired behaviour, and then this array can be converted zero copy to a batch or table:
In [4]: arr = pa.array([{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}])
In [5]: batch = pa.RecordBatch.from_struct_array(arr)
In [6]: batch
Out[6]:
pyarrow.RecordBatch
a: int64
b: int64
c: int64
----
a: [1,3]
b: [2,4]
c: [null,5]
In [7]: batch.to_pandas()
Out[7]:
a b c
0 1 2 NaN
1 3 4 5.0
I suppose for larger data, this should actually also be faster, and so we should maybe consider using that under the hood as well.
Describe the enhancement requested
This:
results in
I think it's kind of okay, but also a bit surprising. I think there should at least be parameter so that first result would also include column "c":
There are API's that do not return properties in case they are null, if you call from_pylist for such one, you will loose data. In my case the API was the Power BI API.
One can of you course pass the schema to workaround this, but that's not always a good option (eg, you'll loose additional new properties)
Component(s)
Python