apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

Can't build array of type timestamp from iterable #478

Closed aosingh closed 2 weeks ago

aosingh commented 4 months ago

Thanks to the Arrow community for developing this lightweight wrapper.

I am planning to add support for Apache Arrow in one of the projects I am working on. The aim is to leverage nanoarrow to support exporting tabular data in arrow format.

Users will have access to a function to_arrow():

import nanoarrow as na

def gen_name():
    for i in range(100):
        yield "John Doe"

def gen_age():
    for i in range(100):
        yield 34

def to_arrow():
    results = [na.c_array(gen_name(), na.string()), na.c_array(gen_age(), na.int64())]
    return results

Users of the library can optionally install pyarrow and pandas to work with the exported data. And the export works fine!

import pyarrow as pa
parray = pa.Table.from_arrays(to_arrow(), names=["name", "age"])
print(parray.to_pandas())
        name  age
0   John Doe   34
1   John Doe   34
2   John Doe   34
3   John Doe   34
4   John Doe   34
..       ...  ...
95  John Doe   34
96  John Doe   34
97  John Doe   34
98  John Doe   34
99  John Doe   34

[100 rows x 2 columns]

Adding a third field timestamp to the above list raises an error:

def gen_timestamp():
    for i in range(100):
        yield datetime.datetime.now().timestamp()

result = [na.c_array(gen_name(), na.string()),
          na.c_array(gen_age(), na.int64()),
          na.c_array(gen_timestamp(), na.timestamp("s"))]

parray = pa.Table.from_arrays(result, names=["name", "age", "timestamp"])

print(parray.to_pandas())

Error:

Traceback (most recent call last):
  File "/Users/as/nanoarrow/simple.py", line 28, in <module>
    na.c_array(gen_timestamp(), na.timestamp("s"))]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/as/nanoarrow/arrow-nanoarrow/python/src/nanoarrow/c_array.py", line 131, in c_array
    raise ValueError(
ValueError: An error occurred whilst converting generator to nanoarrow.c_array: 
 Can't build array of type timestamp from iterable

I understand the source of error is the mapping maintained for each datatype.

How can I add support to incrementally build arrays for more datatypes ?

paleolimbot commented 4 months ago

I'm glad this is useful!

We're about to release the current bindings and unfortunately I don't think we can get out-of-the-box support for this before the release (there will probably be another release in early July).

Getting the details right for the full matrix of Arrow date/time/datetime vs. Python date/time/datetime objects s hard; however, if you have full control over the datetime objects you are producing, the workaround is fairly compact (below).

import datetime
import nanoarrow as na

def gen_name():
    for i in range(10):
        yield "John Doe"

def gen_age():
    for i in range(10):
        yield 34

def gen_timestamp():
    for i in range(10):
        yield datetime.datetime.now().timestamp()

def to_arrow():
    # Declare schema with the actual arrow type (timestamp)
    schema = na.struct(
        {"name": na.string(), "age": na.int64(), "timestamp": na.timestamp("ms")}
    )

    # Create column as an int64 array with storage values
    columns = [
        na.c_array(gen_name(), na.string()),
        na.c_array(gen_age(), na.int64()),
        na.c_array((int(t * 1e3) for t in gen_timestamp()), na.int64()),
    ]

    # Skip validation when creating from buffers
    return na.c_array_from_buffers(
        schema,
        length=columns[0].length,
        buffers=[None],
        children=columns,
        validation_level="none",
    )

na.Array(to_arrow())

If you'd like to help add support, one way would be to add a method to the ArrayFromIterableBuilder:

https://github.com/apache/arrow-nanoarrow/blob/c413d69f78eedefac378a318390e808b5a16e6b9/python/src/nanoarrow/c_array.py#L481-L484

then add a line in the mapping that you linked with a mapping from CArrowType.TIMESTAMP to the name of the method you added:

https://github.com/apache/arrow-nanoarrow/blob/c413d69f78eedefac378a318390e808b5a16e6b9/python/src/nanoarrow/c_array.py#L531-L535

Getting the details right with repsect to timezones and units is hard, but is essentially a reverse-engineered version of the conversion in the other direction:

https://github.com/apache/arrow-nanoarrow/blob/c413d69f78eedefac378a318390e808b5a16e6b9/python/src/nanoarrow/iterator.py#L382-L403

aosingh commented 4 months ago

Thank you, this is helpful.

Let me think through the details for Python datetime/date support and the test cases.

paleolimbot commented 2 weeks ago

Closed in #478!