chore(python): Restructure buffer packing to support nulls and improve performance

First, this PR fixes the rather uninformative error that occurs on any error while building an Array (closes #423). The error is now:

import nanoarrow as na
na.Array([1, 2, 3])
#> ValueError
#> ...
#> An error occurred whilst converting object of type list to nanoarrow.c_array_stream or nanoarrow.c_array: 
#> schema is required for CArray import from iterable

Second, this PR adds support for None in iterables. This makes it much more convenient to create arrays with nulls (closes #424).

import nanoarrow as na
na.Array([1, 2, None, 4], na.int32())
#> nanoarrow.Array<int32>[4]
#> 1
#> 2
#> None
#> 4

Finally, this PR tweaks the implementation of packing an iterable into a buffer to avoid the very bad performance that existed previously. The optimizations added were:

The CBufferBuilder now implements the buffer protocol (so that we can use pack_into)
The __len__ attribute is checked to preallocate where possible

Those optimizations resulted in a ~2x improvement over the previous code; however, the types that can use the array constructor have the biggest wins (5-6x improvement).

An example with the biggest gain:

import numpy as np
import nanoarrow as na
import pyarrow as pa

floats = np.random.random(int(1e6))
floats_lst = list(floats)

%timeit pa.array(floats, pa.float64())
#> 1.79 µs ± 9.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit pa.array(floats_lst, pa.float64())
#> 13.8 ms ± 35.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pa.array(iter(floats_lst), pa.float64())
#> 17.9 ms ± 37.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit na.c_array(floats, na.float64())
#> 5.51 µs ± 25.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit na.c_array(floats_lst, na.float64(nullable=False))
#> 16.5 ms ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit na.c_array(iter(floats_lst), na.float64(nullable=False))
#> 29.1 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(floats_lst, na.float64())
#> 43.6 ms ± 484 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(iter(floats_lst), na.float64())
#> 43 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Before this PR:

%timeit na.c_array(floats, na.float64())
#> 5.66 µs ± 44.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%timeit na.c_array(floats_lst, na.float64())
#> 104 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit na.c_array(iter(floats_lst), na.float64())
#> 107 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

It should be noted that there is probably one more PR on top of this to support building variable-length string/binary arrays (and possibly move some of the building code out of c_lib.py since it is getting a little crowded there).

apache / arrow-nanoarrow

chore(python): Restructure buffer packing to support nulls and improve performance #426