apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.28k stars 3.47k forks source link

[Python] `pa.Table.from_pylist` support list of tuples? #43435

Open alanhdu opened 1 month ago

alanhdu commented 1 month ago

Describe the enhancement requested

I have a function that returns an iterator-of-tuples and would like to turn that into pyarrow table. I have the column names separately, would like to use the PyArrow's type inference for the actual types.

I can sort of get what I want with something like:

import pandas as pd

pa.Table.from_pandas(
    pd.DataFrame.from_records(tuples, columns=column_names)
)

But this doesn't quite work, since Pandas will cast nullable integers to floats. I can obviousl also do this "manually" (e.g. via pa.Table.from_pylist([dict(zip(column_names, row)) for row in rows]) or something), but I'm wondering if there's a faster way to do this.

Component(s)

Python

jorisvandenbossche commented 1 month ago

@alanhdu thanks for the issue. I think this is indeed something that from_pylist could support, and a good feature request.

In that case, the schema argument should probably be required (or it could be relaxed to just a list of column names).

The current implementation lives here:

https://github.com/apache/arrow/blob/62fd98704dbe2684018707a7b135751fa7bfbe5a/python/pyarrow/table.pxi#L6160-L6197

(it's in a cython file, but essentially it's just pure python in this case)

I think it should be relatively straightforward to edit that to also support tuples (or to have a variant that supports tuples).

buaazhwb commented 1 month ago

take

buaazhwb commented 2 weeks ago

For now, we can regard the from_pydict as a method for creating Table with column-like data, since its input is a mapping of field to array. We also have from_pylist which can create Table with row-like data. But from_pylist requires that data and field are bounded, because it takes a dict as a row. Seems we need to provide an api to support creating table with seperate data and schema. I initialy plan to add a new method, eg from_pytuple, to support this issue, but this method can also process list data, not only tuple. So the method name will be consufing. Now I decide to edit the from_pylist to support this. Any suggestion on this? thx! @jorisvandenbossche