apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
149 stars 34 forks source link

feat(python): Add column-wise buffer builder #464

Closed paleolimbot closed 1 month ago

paleolimbot commented 1 month ago

This PR implements building columns buffer-wise for the types where this makes sense. It also implements a few other changes:

Functionally this means that the Array and ArrayStream now have to_column() and to_column_list() methods that do something that more closely matches what somebody would expect.

A quick demo:

import nanoarrow as na
import pyarrow as pa

batch = pa.record_batch({"col1": [1, 2, 3], "col2": ["a", "b", "c"]})
batch_with_nulls = pa.record_batch({"col1": [1, None, 3], "col2": ["a", "b", None]})

# Either builds a buffer or a list depending on column types
na.Array(batch).to_columns_pysequence()
#> (['col1', 'col2'],
#>  [nanoarrow.c_lib.CBuffer(int64[24 b] 1 2 3), ['a', 'b', 'c']])

# One can inject a null handler (a few experimental ones provided)
na.Array(batch_with_nulls).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel())
#> (['col1', 'col2'], [array([ 1., nan,  3.]), ['a', 'b', None]])

# ...by default you have to choose how to do this or we error
na.Array(batch_with_nulls).to_columns_pysequence()
#> ValueError: Null present with null_handler=nulls_forbid()

This will basically get you data frame conversion:

import nanoarrow as na
import pandas as pd

url = "https://github.com/apache/arrow-experiments/raw/main/data/arrow-commits/arrow-commits.arrows"
names, data = na.ArrayStream.from_url(url).to_columns_pysequence(handle_nulls=na.nulls_as_sentinel())
pd.DataFrame({k: v for k, v in zip(names, data)})
#>                                          commit                      time  \
#> 0      49cdb0fe4e98fda19031c864a18e6156c6edbf3c 2024-03-07 02:00:52+00:00   
#> 1      1d966e98e41ce817d1f8c5159c0b9caa4de75816 2024-03-06 21:51:34+00:00   
#> 2      96f26a89bd73997f7532643cdb27d04b70971530 2024-03-06 20:29:15+00:00   
#> 3      ee1a8c39a55f3543a82fed900dadca791f6e9f88 2024-03-06 07:46:45+00:00   
#> 4      3d467ac7bfae03cf2db09807054c5672e1959aec 2024-03-05 16:13:32+00:00   
#> ...                                         ...                       ...   
#> 15482  23c4b08d154f8079806a1f0258d7e4af17bdf5fd 2016-02-17 12:39:03+00:00   
#> 15483  16e44e3d456219c48595142d0a6814c9c950d30c 2016-02-17 12:38:39+00:00   
#> 15484  fa5f0299f046c46e1b2f671e5e3b4f1956522711 2016-02-17 12:38:39+00:00   
#> 15485  cbc56bf8ac423c585c782d5eda5c517ea8df8e3c 2016-02-17 12:38:39+00:00   
#> 15486  d5aa7c46692474376a3c31704cfc4783c86338f2 2016-02-05 20:08:35+00:00   
#> 
#>        files  merge                                            message  
#> 0          2  False  GH-40370: [C++] Define ARROW_FORCE_INLINE for ...  
#> 1          1  False     GH-40386: [Python] Fix except clauses (#40387)  
#> 2          1  False  GH-40227: [R] ensure executable files in `crea...  
#> 3          1  False  GH-40366: [C++] Remove const qualifier from Bu...  
#> 4          1  False  GH-20127: [Python][CI] Remove legacy hdfs test...  
#> ...      ...    ...                                                ...  
#> 15482     73  False  ARROW-4: This provides an partial C++11 implem...  
#> 15483      8  False  ARROW-3: This patch includes a WIP draft speci...  
#> 15484    124  False                 ARROW-1: Initial Arrow Code Commit  
#> 15485      2  False             Update readme and add license in root.  
#> 15486      1  False                                     Initial Commit  
#> 
#> [15487 rows x 5 columns]
paleolimbot commented 1 month ago

I hear everybody on the naming thing! "Column" is not great because it doesn't have a precedent here (the closest thing would be pyarrow.Table.column(), which is still giving arrow arrays), and "Builder" is already used to describe the conversion of to an array. "Concatenate" might imply that we're returning an Array, which we could do or build using this machinery, but is not really the desired endpoint here.

There is precedent for the term "convert" in the R bindings ( https://arrow.apache.org/nanoarrow/latest/r/reference/convert_array_stream.html ) and Arrow C++ ( https://github.com/apache/arrow/blob/2dbc5e26dcbc6826b4eb7a330fa8090836f6b727/cpp/src/arrow/util/converter.h#L40 ), and so I gave that terminology a try in the last few commits.

The crux of what these helpers are trying to do is to get a stream of arrays (possibly of indeterminate length) out of Arrow land to be represented by something else. The default something else has to be limited to the Python standard library because of the zero dependencyness, which is "pybuffer or list". In the R bindings you can do things like:

convert_array_stream(stream)  # default conversion
convert_array_stream(stream, tibble::tibble())  # explicit output prototype

Here, visitable.convert() could do the same thing (although won't in this PR because it's a can of worms, and maybe not ever if nobody ends up using the high-level interface).

array.convert() # default conversion
array.convert(np.int32) # ...would get you an np.array with dtype int32

I will also mark these as "experimental" such that it's clear we're settling on the terminology/behaviour/scope here.