apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
168 stars 35 forks source link

Define the relationship between nanoarrow and C++ #599

Open bkietz opened 2 weeks ago

bkietz commented 2 weeks ago

Currently it's not explicit what level of C++ is recommended for use in the nanoarrow project. The library itself is of course a minimal API in pure C, but a C++ helper library is packaged alongside and the unit tests are all C++.

This is part of general tension in the project between ergonomics and minimalism of the API. New helper functions are suspect for fear of bloating the API, but this leads to extremely verbose unit tests and code examples.

WRT using unit tests as documentation: although I think it's useful to have complete and self-contained examples of API usage, I think the priority for unit tests should be coverage rather than didactics.

I'd propose:

WillAyd commented 2 weeks ago

This is a great question. I've documented using the C++ layer myself in some youtube videos on nanoarrow. I've personally been hoping to see it grow a little bit more to make it easier to iterate some of the nanoarrow structures (for example, I think it would be nice to have an iterator for StructArrays that helps you work with dataframe data), but I also don't have a clear point of view on where exactly the line should be drawn

paleolimbot commented 2 weeks ago

Definitely a great point that this is not well defined! A few initial thoughts:

this leads to extremely verbose unit tests and code examples.

Things like IPC, ADBC, and integration testing are disproportionately affected here, since they are essentially nanoarrow "users" more so than the nanoarrow C library itself. These are places that there is significant verbosity unrelated to the function being tested (e.g., creating an array to roundtrip through IPC), and where I wish I had things like ArrayFromJSON(). We are a C library and C is verbose, but I think the key to all of this is limiting the C-verbosity to the function being tested.

find out whether the C++ layer is seriously used

I am aware of nanoarrow::UniqueXXX helper usage (Snowflake Python connector, cudf, ADBC) and array stream helpers (ADBC). I recently used the buffer-from-std-vector helpers for some GeoArrow tests also.

does strict minimalism apply to that as well or can the C++ layer grow more freely?

There could definitely be C++ bindings to the C library (or standalone header-only C++ like sparrow)...I don't think that until now there has been anybody with interest in designing and maintaining the interface. C++17 is definitely the way to go here (but there would need to be discussion on the mailing list of whether this is a good idea since it's possibly stepping on Arrow C++'s/sparrow's toes and there would need to be significant interest in contributing/reviewing).

WRT using unit tests as documentation

We definitely should have examples (the fact that we don't have them is just an issue of time) and the intention of the unit testing has never been to replace that. We are a C library, though, and the C library tests will need to have some C library calls in them and should vaguely follow our own advice (e.g., we advise C++ users to use the nanoarrow::UniqueXXX classes to manage cleanup).

lidavidm commented 2 weeks ago

WRT using unit tests as documentation: although I think it's useful to have complete and self-contained examples of API usage, I think the priority for unit tests should be coverage rather than didactics.

Just chiming in here, while it's not the goal for here, this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py

Demo here: https://arrow.apache.org/adbc/current/cpp/quickstart.html

eddelbuettel commented 2 weeks ago

Thumbs-up for building on top of nanoarrow for use via C++.

The fact that we have a minimal C interface here is so valuable for more complex build setups as (just about) everybody can use a C-level foreign-function interface. And placing elegant and usable C++ header-only layers on top is then pure magic. I have been told this is called the hour-glass principle. Last time this came up (as well as above) @paleolimbot pointed at sparrow so what would be a differentiator to that (alpha ?) project?

bkietz commented 2 weeks ago

IIUC sparrow is designed for consumption exclusively as a C++20 library. Although this makes their API seriously expressive, it makes the library less accessible as an ingredient in other APIs. As I understand it, nanoarrow's priority is ease of consumption of the core C API; a primary example would be its own R and python bindings:

In light of this, I think strictly framing nanoarrow's C++ layer as just one more higher-level language binding makes the most sense. I think it's a goal to provide a usefully, predictably uniform kernel of functions across languages in the same original spirit of the C-ABI itself, almost like an extension of the C-ABI. This is in contrast to sparrow which AFAICT is intended to wrap and use the C-ABI for ergonomic and idiomatic C++20 usage.

We are a C library, though, and the C library tests will need to have some C library calls in them

I'm hesitant to agree with this since it feels that the implication is once again that the unit tests should contain mostly recognizable usages of the API. Instead I'd say test writers should have a free hand to use and extend C++ helpers as needed, so that the only C library call which appears in a given unit test is the function exercised by the test. For example, no unit test in ipc/ should need to call ArrowSchemaInit*; that's tested elsewhere and the test writers should use something less verbose to get their schemas.

this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py

I love this. I'll definitely consider borrowing it for writing nanoarrow examples

paleolimbot commented 2 weeks ago

this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py

I have been looking for a reasonable way to write tutorials/examples for a long time...thank you for writing it/passing this on!

so that the only C library call which appears in a given unit test is the function exercised by the test

We are on the same page here! I do think that leaning too hard on advanced C++ and/or advanced gtest/gmock in tests limits the ability of those not familiar to contribute, but given that most of the existing tests use pretty much no features of C++/gmock/gtest, I think we can get substantial improvement without going there.

strictly framing nanoarrow's C++ layer as just one more higher-level language binding makes the most sense

Definitely on board if there is interest (and it sounds like there is)!