apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 86 forks source link

format: support multiple result sets #1358

Open CurtHagenlocher opened 9 months ago

CurtHagenlocher commented 9 months ago

Other standard client APIs support multiple result sets; examples include ODBC's SQLMoreResults, JDBC's getMoreResults and ADO.NET's NextResult. It would be nice to support this for ADBC as well.

lidavidm commented 9 months ago

Looks like we'll want to do another revision soon, possibly also to roll up the device-aware Arrow APIs (CC @zeroshade) and reconsider catalog/schema metadata and statistics (CC @adamkennedy)

CurtHagenlocher commented 3 weeks ago

Support for this will need to contend with three distinct consumption possibilities which correspond to AdbcStatementExecuteQuery, AdbcStatementExecutePartitions, and AdbcStatementExecuteSchema. Here's an opening bid for an API that supports all three:

#define ADBC_STATUS_COMPLETE 15

AdbcStatusCode AdbcStatementNextResult(
    struct AdbcStatement* statement,
    struct ArrowSchema* schema,
    struct ArrowArrayStream* out,
    struct AdbcPartitions* partitions,
    int64_t* rows_affected,
    struct AdbcError* error);

A driver may support AdbcStatementNextResult while the previous result is still being consumed. A driver which does not must return ADBC_STATUS_INVALID_STATE while previous results are still being read. A driver must return ADBC_STATUS_COMPLETE if there are no more results to be read. This must not return an error.

If the original execution was via AdbcStatementExecuteSchema, then the contents of out, partitions and rows_affected are unchanged and the schema of the next result are placed into schema.

Either partitions or out must be null to indicate which style of output is desired by the caller. Supplying both (or supplying either when doing a schema-only evaluation) must result in ADBC_STATUS_INVALID_ARGUMENT. If the original execution was via AdbcStatementExecuteQuery and the call to AdbcStatementNextResult has a valid partitions or if the original execution was via AdbcStatementExecutePartitions and the call has instead suppled out then the driver may choose to return the data in a different style than the original result set. If it does not (or cannot) then the driver must return ADBC_STATUS_INVALID_ARGUMENT.

lidavidm commented 3 weeks ago

That sounds reasonable to me.

The one nit I'd have is that usually we indicate end-of-stream by setting the output's release to NULL vs having a special status code

CurtHagenlocher commented 3 weeks ago

Okay, so instead of returning a different status, we'd return ADBC_STATUS_OK and set the release on schema and out (if supplied) to NULL? That works.

I gather from another issue that we're gearing up for a release soon, and I'm a bit embarrassed to admit that the release schedule is completely opaque to me. Is it the case that both the core Arrow project and the ADBC subproject are on a bimonthly release cadence, but on alternating months?

Regardless, it seems that there's almost certainly not enough time to get this change in for the very next release, but I'd like to try to drive this item and a bunch of other API changes on the backlog and combine them into a 1.2 version of the spec for the subsequent release.

lidavidm commented 3 weeks ago

I gather from another issue that we're gearing up for a release soon, and I'm a bit embarrassed to admit that the release schedule is completely opaque to me. Is it the case that both the core Arrow project and the ADBC subproject are on a bimonthly release cadence, but on alternating months?

We're independent of the core project. I've been trying to aim for a release every 6-8 weeks (mostly depends on my schedule and if I remember to look at the calendar...)

Regardless, it seems that there's almost certainly not enough time to get this change in for the very next release, but I'd like to try to drive this item and a bunch of other API changes on the backlog and combine them into a 1.2 version of the spec for the subsequent release.

I think for 1.1, I created a branch and merged everything there over the course of a few releases, then when it was ready did the big merge into the main branch for a release. That might be better since (1) I'd rather batch up changes as much as possible to limit how often we change ABI and (2) that'll give everyone more time to work through changes.