Proposal to implement ADBC driver for Apache Cassandra

SChakravorti21 commented 1 month ago

What feature or improvement would you like to see?

There isn't an existing ADBC driver for Cassandra as far as I can tell, and it would be great to have one! I'm interested in starting this effort as I have experience getting Arrow data to/from Cassandra, and have a little experience working on an ADBC driver for a different database (comdb2). I met @zeroshade at Community Over Code recently, who inspired me to start the discussion around creating a Cassandra driver :)

Some initial thoughts:

Choice of language
- I'm personally most familiar with the Cassandra C/C++ driver as well as Arrow C++. However, if there's good reason to implement the driver in a different language, I'm open to that and happy to get up to speed.
- Matt explained that it would be better to use nanoarrow rather than Arrow C++ as the latter is a heavy dependency and can complicate building/deploying drivers. Using nanoarrow sounds like a good idea to me.
Implementation considerations
- Cassandra currently does not offer any native mechanism for fetching/ingesting data in Arrow format, so we would likely to have to implement row ↔ column transposition on the client side (in the driver).
- The Cassandra Query Language (CQL) can be thought of as an extremely limited subset of SQL. This StackOverflow answer is a good overview of the general limitations. I figure this shouldn't matter as far as implementing an ADBC driver is concerned, but thought it was worth mentioning in case I'm wrong.
- Matt also mentioned that there is now an ADBC driver framework. I don't see any reason not to use this. If we find any gaps in the framework while implementing the driver, I'm happy to help fill them in.
First step(s)
- Matt mentioned that, before implementing anything, it would be good to stand up a Cassandra node/cluster in CI so that others can also play around with and contribute to the driver.
- I suppose the next step would be to configure the build system to pull in the necessary dependencies (like the Cassandra C/C++ driver).
- ... Start implementing the driver along with integration tests?

I'd love to hear any other considerations for implementing this ADBC driver and/or recommendations on getting started!

paleolimbot commented 1 month ago

I'm personally most familiar with the Cassandra C/C++ driver as well as Arrow C++.

Matt also mentioned that there is now an ADBC driver framework.

I'm hoping to finish it this week, but there's a work-in-progress tutorial of how to get started building a driver in C++ using nanoarrow/the framework here! https://github.com/apache/arrow-adbc/pull/2186 . Arrow C++ presents a packaging problem (e.g., difficult/impossible to make an R driver wrapper, Python wrapper would require pinning a version of pyarrow until we sort out how to put two different Arrow C++ versions in the same process), which is why Matt probably recommended nanoarrow.

However, if there's good reason to implement the driver in a different language, I'm open to that and happy to get up to speed.

It's a bit subjective, but all our existing drivers lean on the most arrowish SDK available for the driver (e.g., Postgres has libpq for C, so we implemented that in C++; Snowflake and BigQuery have Arrow integrations in their Go connectors, so we wrote those in Go). I have no idea what Cassandra provides, but if it had a fairly complete Go or Rust client already and nothing for C++ that might be a good reason to implement it in those languages. The fact that you know C++ and you're motivated counts for a lot, though!

Matt mentioned that, before implementing anything, it would be good to stand up a Cassandra node/cluster in CI

We have some docker compose services for databases for this purposes. You could do a PR first that makes it so that we can do docker compose up apache-cassandra-test. (Since I think you would be a "first time contributor", this would also make it so that the PR where you actually implement the driver doesn't require one of us to OK the CI jobs after every push). (Apologies if I understand Cassandra too poorly and this is not a good fit!)

so we would likely to have to implement row ↔ column transposition on the client side (in the driver).

The Postgres driver has an example of writing tests for this without a live connection to the database (the "copy" tests). No pressure to do it exactly like that but I found it useful to accelerate the process of adding full type support there.

I'd love to hear any other considerations for implementing this ADBC driver

Where to put it is a good thing to think about...ideally we'd (maybe just speaking for me here) like for ADBC connectors to live with the project instead of with us to spread out the maintenance load (e.g., like DuckDB), but there is also not a straightforward way to implement the validation suite outside this repository (or if there is, nobody has tried it yet!). Probably the easiest place to start is as a PR into apache/arrow-adbc and move it when we sort out those details.

Feel free to ping me early and often as you get started (probably everybody else is game too, but I'll let the volunteer themselves 🙂 ). All of this is helpful for us too since we've all had build setups for ADBC since the beginning and forget the issues encountered by those new to the project.

Start implementing the driver along with integration tests?

🚀

If I had to suggest a place to start it would be to get a "hello world" example running where you can open and close a connection to the database. Then you could perhaps follow that up with implementing the statement's ExecuteQuery for a case of a single type (int32 or string maybe). All just suggestsions (do your thing!)

lidavidm commented 1 month ago

Thanks Dewey for the detailed guide!

I'd say so long as Cassandra in C++ doesn't require gRPC, and you can use nanoarrow (instead of Arrow C++ for the aforementioned reasons), then C++ would be great.

there is also not a straightforward way to implement the validation suite outside this repository (or if there is, nobody has tried it yet!)

We briefly did this when I tried to get Flight SQL into pyarrow instead of having a separate wheel. I think it wasn't complicated, we'd have to probably fix up the CMake definitions again though.

SChakravorti21 commented 1 month ago

@paleolimbot @lidavidm Thank you so much for the detailed guidance and insights! My apologies for not being able to respond sooner. I was pretty busy for the past couple of weeks but my schedule has freed up a bit now :)

I'm hoping to finish it this week, but there's a work-in-progress tutorial of how to get started building a driver in C++ using nanoarrow/the framework here!

Perfect timing! I'll be sure to check it out and share any feedback on the tutorial.

Arrow C++ presents a packaging problem (e.g., difficult/impossible to make an R driver wrapper, Python wrapper would require pinning a version of pyarrow until we sort out how to put two different Arrow C++ versions in the same process), which is why Matt probably recommended nanoarrow.

All of that makes sense, thanks for clarifying. I think we've had trouble using Arrow C++ in Python extensions for similar reasons, so I'm totally onboard with using nanoarrow.

It's a bit subjective, but all our existing drivers lean on the most arrowish SDK available for the driver... I have no idea what Cassandra provides, but if it had a fairly complete Go or Rust client already and nothing for C++ that might be a good reason to implement it in those languages.

These are good points! From some quick searching around, these are the main drivers I found in each language:

C++: DataStax C/C++ Driver. Although it doesn't seem to have much active development, we've been using it for a while in production and it is generally performant and reliable.
Go: Cassandra GoCQL Driver. Looks like this initially started as an independent project and was recently (as of this year) donated to Cassandra. I've noticed that ScyllaDB is maintaining their own fork of this driver, which appears to be much more actively maintained.
Rust: the major ones I'm aware of are cassandra-rs, cdrs-tokio, and ScyllaDB's Rust driver. My understanding (based on what I've learned from my colleagues) is that Scylla's driver is the most performant/reliable right now.

A couple of other random observations/thoughts:

Cassandra's list of documented drivers seems to suggest that DataStax owns/maintains quite a lot of the major drivers.
If the hope is to eventually donate this ADBC driver to the Cassandra project, maybe using one of the drivers they maintain (like GoCQL) would make it a less controversial suggestion.

The choice does seem like a bit of a toss-up. Like you said, I could be of more help if we pursued this in C++, so I'm still leaning towards that approach.

To address David's point about dependency management, the C/C++ driver lists the following dependencies:

CMake v2.6.4+

libuv 1.x

Kerberos v5 (Heimdal or MIT) *

OpenSSL v1.0.x or v1.1.x **

zlib v1.x ***

Certainly no gRPC, but please let me know if any of these seem problematic.

We have some docker compose services for databases for this purposes. You could do a PR first that makes it so that we can do docker compose up apache-cassandra-test.

Sounds good to me! It should be doable, Cassandra does publish official Docker images: https://hub.docker.com/_/cassandra.

The Postgres driver has an example of writing tests for this without a live connection to the database (the "copy" tests). No pressure to do it exactly like that but I found it useful to accelerate the process of adding full type support there.

I've been wondering how to approach this as well, so I'll definitely take a look at that. Appreciate the pointer.

Where to put it is a good thing to think about... ideally we'd (maybe just speaking for me here) like for ADBC connectors to live with the project instead of with us to spread out the maintenance load...

I agree, it would be nice if the ADBC driver lived under the Cassandra project. This was also discussed in the most recent Arrow community call. The main hurdle seems to be that Cassandra is generally used for more OLTP-style workloads, so it may not be clear to them how Arrow fits into the picture or why an ADBC driver is necessary. I think starting in this repo to prove out the idea and then approaching the Cassandra community would be a viable strategy.

Feel free to ping me early and often as you get started (probably everybody else is game too, but I'll let the volunteer themselves 🙂).

Thank you! I'll definitely reach out as necessary and plan to use draft PRs to get early feedback (if that's ok).

lidavidm commented 1 month ago

Thanks for looking into all this!

The one thing that might be questionable is a hard dependency on OpenSSL 1.x, but I suppose that migration is still ongoing and I don't think anything should pose dependency conflicts AFAIK.

paleolimbot commented 4 weeks ago

To address David's point about dependency management, the C/C++ driver lists the following dependencies:

This shouldn't be a blocker in any way shape or form, but those aren't dependencies I can wrap into an R package that goes on CRAN (I can probably wrap it as an R package that doesn't go on CRAN, though, and all of our Go based drivers don't go on CRAN either).

ianmcook commented 6 days ago

Cassandra currently does not offer any native mechanism for fetching/ingesting data in Arrow format, so we would likely to have to implement row ↔ column transposition on the client side (in the driver).

A couple of years ago @0x26res created an Arrow-based Python driver for Cassandra and gave a great talk about it. That might be something you could build on or at least take a look at.

SChakravorti21 commented 5 days ago

A couple of years ago @0x26res created an Arrow-based Python driver for Cassandra and gave a great talk about it. That might be something you could build on or at least take a look at.

@ianmcook Thank you for sharing this! I agree that decoding the raw bytes received from the database is likely to provide a noticeable performance improvement. @0x26res It's promising that you have a working example of this, and it's definitely worth building on top of. I dug around the C/C++ Cassandra driver and couldn't find an obvious way of getting those raw bytes, however, so we might need to ask the driver maintainers if they'd be open to having such functionality be exposed.

apache / arrow-adbc

Proposal to implement ADBC driver for Apache Cassandra #2245

What feature or improvement would you like to see?