apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
362 stars 89 forks source link

Placeholder syntax in parametrized queries #1397

Open nbenn opened 9 months ago

nbenn commented 9 months ago

Currently there is no way to determine what types of placeholders a backend supports for parametrized queries.

SQLite for example is quite flexible, supporting all of "?", "$i", "$name", ":name" (where "i" represents a parameter position and "name" a parameter name), while Postgres (I believe) only supports "?" and "$i" and cannot handle named parameters.

In what way this information could be communicated and whether it is worthwhile to do so is currently unclear to me. I'm hoping that this could be a place to have such a discussion.

Maybe I can start out explaining a bit why this became relevant tom me and then we see where this goes. It might be that we can sort out my specific issues without introducing such infrastructure at all. Of course this might be interesting to others nonetheless.

For context: I'm writing a high level wrapper in R that aligns the DBI interface with functionality exposed by adbcdrivermanager. As example, consider the following query:

library(adbcdrivermanager)

db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = ":memory:")
con <- adbc_connection_init(db)

write_adbc(datasets::swiss, con, "swiss", temporary = TRUE)

stmt <- adbc_statement_init(con)
adbc_statement_set_sql_query(
  stmt,
  "SELECT * from swiss WHERE Agriculture < $agr AND Education > $edu"
)
adbc_statement_prepare(stmt)

We have a "parameter schema" with parameter names

names(adbc_statement_get_parameter_schema(stmt)$children)
#> [1] "$agr" "$edu"

At the same time, I want my users to be able to pass parameter values by referring to more natural names such as "agr" and "edu". Something like

execute_query <- function(...) {

  params <- data.frame(...)
  adbc_statement_bind_stream(stmt, params)

  res <- nanoarrow::nanoarrow_allocate_array_stream()
  adbc_statement_execute_query(stmt, res)

  as.data.frame(res)
}

nrow(execute_query(agr = 30, edu = 10))
#> [1] 9

As things stand, it seems that for named parameters, names are not considered and only positions are relevant. This makes the current implementation of parameter binding (at least for SQLite) quite lenient.

nrow(execute_query(agr = 30, typo = 10))
#> [1] 9
nrow(execute_query(edu = 10, agr = 30))
#> [1] 1
nrow(execute_query(agr = 30, edu = "typo"))
#> [1] 0

If I want to properly support named parameters, I have to sort out ordering on my end, as well as potentially raising errors for name/type mismatches. To do this, I need to match names in the parameter schema with my "natural" names. And this is where having information on what kinds of placeholder patterns are supported by the back-end becomes relevant to me.

Cc @krlmlr @paleolimbot

paleolimbot commented 9 months ago

Rather than try to specify some regex or "bind pattern", I wonder if we just want to push the "bind by name" operation into the driver. For example:

adbc_statement_bind_by_name <- function(statement, stream) {
  UseMethod("adbc_statement_bind_by_name")
}

adbc_statement_bind_by_name.default <- function(statement, stream) {
  stop("bind by name not supported by this driver")  # maybe throw a classed error that you can catch
}

The implementation for SQLite, for example, could use the parameter schema method (probably stealing the implementation from RSQLite).

krlmlr commented 9 months ago

RPostgres tests only positional:

https://github.com/r-dbi/RPostgres/blob/c17e5bf728ccb4bb07084d0f882972429e8b0ccc/tests/testthat/helper-DBItest.R#L23

The challenge for DBItest is that it must generate the queries to test against the driver. How would the new generic help?

paleolimbot commented 9 months ago

If it's DBI test that needs this, maybe what we need is:

adbc_driver_dbitest_tweaks <- function(driver) {
  UseMethod("adbc_driver_dbitest_tweaks")
}

adbc_driver_dbitest_tweaks.default <- function(driver) {
  stop("dbitest tweaks not supported by this driver")
}

...since there are likely other driver-specific options that would be better set by the driver package.

krlmlr commented 9 months ago

Do you really want tight coupling between ADBC drivers and DBItest?

paleolimbot commented 9 months ago

It could be specified somewhere else, too (e.g., an adbctest package that is GitHub only), but I don't mind putting DBItest in Suggests and adding the generic. I would eventually like to run DBItest in ADBC's CI (maybe not CRAN's, though)...it makes sense to me that the test configuration lives somewhere in the ADBC repo that is as close to the individual driver as possible.

nbenn commented 9 months ago

As things stand currently, DBItest does not need this. Setting up the DBItest context has to be done per back end anyways and it is there, where the placeholders can be defined explicitly.

It is some intermediary such as adbi, that needs this, as it has to be agnostic of the backend specifics yet might still want to add some smarts such as making sure parameters are passed in the right order (as long as this remains necessary) to isolate users from lower-level specifics.

paleolimbot commented 9 months ago

I agree that it would be very confusing if a query containing $bread, $cheese were passed data.frame(cheese = 1, bread = 2) and the parameters were bound by position. It sounds like the idiomatic way to handle this in DBI is to bind by name if possible.

Testing aside, a simpler method would also be to add a bind_by_position = NA argument to the relevant bind method and warn if NA and error if FALSE (i.e., make somebody write DBIStatmentBind(stmt, some_data_frame, bind_by_position = TRUE) if they don't want a warning.

nbenn commented 9 months ago

@paleolimbot from where I stand, what you proposed with adbc_statement_bind_by_name() would make many of my problems go away here.

Trying to solve my problem specifically, we might also re-consider what names the parameter schema should have exactly. Does it make sense that parameter names come with "placeholder indicators"?

If the schema were to change, the only extra information I'd need is whether named parameter are supported by the back-end or not.

lidavidm commented 9 months ago

For the format/driver side I filed #1398

paleolimbot commented 9 months ago

I think my preference for this specific issue would be:

Alternatively, you could continue to hard-code the string you need in whatever format you need it until the C driver API can provide that information. You would only need it for SQLite, Postgres, Snowflake, and DuckDB.

krlmlr commented 9 months ago

I like the idea of moving the DBItest tests to the respective drivers. @paleolimbot: what kind of support do you need?

paleolimbot commented 9 months ago

See https://github.com/apache/arrow-adbc/pull/1401 . I still get quite a few failures but I imagine I'm not setting it up quite right!