apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
175 stars 38 forks source link

Define the relationship between nanoarrow and C++ #599

Open bkietz opened 3 months ago

bkietz commented 3 months ago

Currently it's not explicit what level of C++ is recommended for use in the nanoarrow project. The library itself is of course a minimal API in pure C, but a C++ helper library is packaged alongside and the unit tests are all C++.

This is part of general tension in the project between ergonomics and minimalism of the API. New helper functions are suspect for fear of bloating the API, but this leads to extremely verbose unit tests and code examples.

WRT using unit tests as documentation: although I think it's useful to have complete and self-contained examples of API usage, I think the priority for unit tests should be coverage rather than didactics.

I'd propose:

WillAyd commented 2 months ago

This is a great question. I've documented using the C++ layer myself in some youtube videos on nanoarrow. I've personally been hoping to see it grow a little bit more to make it easier to iterate some of the nanoarrow structures (for example, I think it would be nice to have an iterator for StructArrays that helps you work with dataframe data), but I also don't have a clear point of view on where exactly the line should be drawn

paleolimbot commented 2 months ago

Definitely a great point that this is not well defined! A few initial thoughts:

this leads to extremely verbose unit tests and code examples.

Things like IPC, ADBC, and integration testing are disproportionately affected here, since they are essentially nanoarrow "users" more so than the nanoarrow C library itself. These are places that there is significant verbosity unrelated to the function being tested (e.g., creating an array to roundtrip through IPC), and where I wish I had things like ArrayFromJSON(). We are a C library and C is verbose, but I think the key to all of this is limiting the C-verbosity to the function being tested.

find out whether the C++ layer is seriously used

I am aware of nanoarrow::UniqueXXX helper usage (Snowflake Python connector, cudf, ADBC) and array stream helpers (ADBC). I recently used the buffer-from-std-vector helpers for some GeoArrow tests also.

does strict minimalism apply to that as well or can the C++ layer grow more freely?

There could definitely be C++ bindings to the C library (or standalone header-only C++ like sparrow)...I don't think that until now there has been anybody with interest in designing and maintaining the interface. C++17 is definitely the way to go here (but there would need to be discussion on the mailing list of whether this is a good idea since it's possibly stepping on Arrow C++'s/sparrow's toes and there would need to be significant interest in contributing/reviewing).

WRT using unit tests as documentation

We definitely should have examples (the fact that we don't have them is just an issue of time) and the intention of the unit testing has never been to replace that. We are a C library, though, and the C library tests will need to have some C library calls in them and should vaguely follow our own advice (e.g., we advise C++ users to use the nanoarrow::UniqueXXX classes to manage cleanup).

lidavidm commented 2 months ago

WRT using unit tests as documentation: although I think it's useful to have complete and self-contained examples of API usage, I think the priority for unit tests should be coverage rather than didactics.

Just chiming in here, while it's not the goal for here, this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py

Demo here: https://arrow.apache.org/adbc/current/cpp/quickstart.html

eddelbuettel commented 2 months ago

Thumbs-up for building on top of nanoarrow for use via C++.

The fact that we have a minimal C interface here is so valuable for more complex build setups as (just about) everybody can use a C-level foreign-function interface. And placing elegant and usable C++ header-only layers on top is then pure magic. I have been told this is called the hour-glass principle. Last time this came up (as well as above) @paleolimbot pointed at sparrow so what would be a differentiator to that (alpha ?) project?

bkietz commented 2 months ago

IIUC sparrow is designed for consumption exclusively as a C++20 library. Although this makes their API seriously expressive, it makes the library less accessible as an ingredient in other APIs. As I understand it, nanoarrow's priority is ease of consumption of the core C API; a primary example would be its own R and python bindings:

In light of this, I think strictly framing nanoarrow's C++ layer as just one more higher-level language binding makes the most sense. I think it's a goal to provide a usefully, predictably uniform kernel of functions across languages in the same original spirit of the C-ABI itself, almost like an extension of the C-ABI. This is in contrast to sparrow which AFAICT is intended to wrap and use the C-ABI for ergonomic and idiomatic C++20 usage.

We are a C library, though, and the C library tests will need to have some C library calls in them

I'm hesitant to agree with this since it feels that the implication is once again that the unit tests should contain mostly recognizable usages of the API. Instead I'd say test writers should have a free hand to use and extend C++ helpers as needed, so that the only C library call which appears in a given unit test is the function exercised by the test. For example, no unit test in ipc/ should need to call ArrowSchemaInit*; that's tested elsewhere and the test writers should use something less verbose to get their schemas.

this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py

I love this. I'll definitely consider borrowing it for writing nanoarrow examples

paleolimbot commented 2 months ago

this might be useful (as janky as it is) https://github.com/apache/arrow-adbc/blob/main/docs/source/ext/adbc_cookbook.py

I have been looking for a reasonable way to write tutorials/examples for a long time...thank you for writing it/passing this on!

so that the only C library call which appears in a given unit test is the function exercised by the test

We are on the same page here! I do think that leaning too hard on advanced C++ and/or advanced gtest/gmock in tests limits the ability of those not familiar to contribute, but given that most of the existing tests use pretty much no features of C++/gmock/gtest, I think we can get substantial improvement without going there.

strictly framing nanoarrow's C++ layer as just one more higher-level language binding makes the most sense

Definitely on board if there is interest (and it sounds like there is)!

paleolimbot commented 3 weeks ago

Do we want wrappers like:

class Buffer {
  public:
    void SetAllocator();
    void Append(int32_t x) { ArrowBufferAppend(obj_.get(), &x, sizeof(x)); }

    uint8_t* data() { return obj_.data; }
    int64_t size() { return obj_.size_bytes(); }

 private:
  UniqueBuffer obj_;
}

...in the C++ bindings? If we do, what do we use for error handling?

WillAyd commented 3 weeks ago

Are we definitely locked into C++17? If not, I wonder if it wouldn't be better to return a std::span instead of a data pointer.

As far as error handling is concerned, maybe we could use C++ exceptions? Having used this downstream, I've always found the NANOARROW_THROW_NOT_OK macro to be a nice convenience, so might be nice to continue using that pattern for members that can throw

bkietz commented 3 weeks ago

+1 for returning span. FTR, that class can be written in C++11 (arrow-cpp has a polyfill)

paleolimbot commented 3 weeks ago

As far as error handling is concerned, maybe we could use C++ exceptions?

I quite like exceptions although I'm vaguely aware that there's a large and vocal contingent who feel differently. There are some pretty serious performance issues that can result from using them inappropriately (e.g., one time I used them for something like StopIteration and it made the framework I'd written unusably slow), but I don't know that any of the types of errors that usually occur from a non-zero ArrowErrorCode fall into that category. If we did go that route we'd have a great workaround (just use the nanoarrow/hpp/unique.hpp header and make C library calls directly).

Are we definitely locked into C++17?

Anything that we want to use in our own tests still has to be C++11 (we still have a rather important user, DuckDB, who supports C++11, so I think it's reasonable that our tests still run), but we can do things with higher standards if they are fenced with an opt-out define. I don't think we have any users interested in helpful C++ wrappers that are still using C++11 but we do have users that are specifically C++17 that might be (cudf, ADBC), and so I think anything higher than that is maybe not all that useful (but maybe possible with an opt-in define).

I wonder if it wouldn't be better to return a std::span instead of a data pointer.

+1! It looks like that's our ArrowBufferView

paleolimbot commented 3 weeks ago

After #668 merges I'll put together a draft PR. Maybe a better place to start would be the Schema wrapper since most errors that occur there are very small allocation failures of the variety that even non-exception using C++ libraries would end up throwing std::bad_alloc.