apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.57k stars 3.54k forks source link

[Python] from_pylist should allow a parameter to scan more records for columns #40124

Open aersam opened 8 months ago

aersam commented 8 months ago

Describe the enhancement requested

This:

import pyarrow

print(pyarrow.Table.from_pylist([{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]))

print(pyarrow.Table.from_pylist([{"a": 1, "b": 2, "c": 5}, {"a": 3, "b": 4}]))

results in

first print:
a: int64     
b: int64
----
a: [[1,3]]
b: [[2,4]]

second print:
a: int64     
b: int64
c: int64
----
a: [[1,3]]
b: [[2,4]]
c: [[5,null]]

I think it's kind of okay, but also a bit surprising. I think there should at least be parameter so that first result would also include column "c":

a: int64     
b: int64
c:  int64
----
a: [[1,3]]
b: [[2,4]]
c: [[null,5]]

There are API's that do not return properties in case they are null, if you call from_pylist for such one, you will loose data. In my case the API was the Power BI API.

One can of you course pass the schema to workaround this, but that's not always a good option (eg, you'll loose additional new properties)

Component(s)

Python

aersam commented 8 months ago

Btw, Polars does this "correctly":

import polars

print(polars.from_dicts([{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]))

gives:
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ i64  │
│ 1   ┆ 2   ┆ null │
│ 3   ┆ 4   ┆ 5    │
jorisvandenbossche commented 8 months ago

The current behaviour is somewhat documented:

    schema : Schema, default None
        If not passed, will be inferred from the first row of the
        mapping values.

But I agree this might be surprising, or in any case that it can be useful to change that behaviour.

Workaround on the short term is to use pa.array to infer the list of dicts to a StructArray, which has the desired behaviour, and then this array can be converted zero copy to a batch or table:

In [4]: arr = pa.array([{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}])

In [5]: batch = pa.RecordBatch.from_struct_array(arr)

In [6]: batch
Out[6]: 
pyarrow.RecordBatch
a: int64
b: int64
c: int64
----
a: [1,3]
b: [2,4]
c: [null,5]

In [7]: batch.to_pandas()
Out[7]: 
   a  b    c
0  1  2  NaN
1  3  4  5.0

I suppose for larger data, this should actually also be faster, and so we should maybe consider using that under the hood as well.