DyfanJones / noctua

Connect R to Athena using paws SDK (DBI Interface)
https://dyfanjones.github.io/noctua/
Other
45 stars 5 forks source link

dbplyr 2.0.0 breaks in_schema in Noctua #117

Closed OssiLehtinen closed 3 years ago

OssiLehtinen commented 3 years ago

Hi Dyfan,

a situation came up when upgrading to dbplyr 2.0.0:

In the new version, in_schema() returns an object of class "dbplyr_schema" (previously "ident"), and the database and table names are automatically quoted.

As an example:

dbplyr < 2.0.0:

> in_schema("database", "table")
<IDENT> database.table

> class(in_schema("database", "table"))
[1] "ident_q"   "ident"     "character"

dbplyr 2.0.0:

> in_schema("database", "table")
<SCHEMA> `database`.`table`

> class(in_schema("database", "table"))
[1] "dbplyr_schema"

Consequently, in_schema does not work with noctua + dbplyr 2.0.0.

One workaround would be to use tbl(con, sql("database.table"), but this is not too elegant and a big negative is that athena is used for getting the table structure instead of paws::glue(), which makes things much slower.

I'm not sure what would be the best fix for this. Perhaps an AthenaConnection specific in_schema-method? Or perhaps just check for "dbplyr_schema" in addition to "ident" when deciding whether to query glue or athena (https://github.com/DyfanJones/noctua/blob/1e34ca1aeb656485690d031e88b294cffaee2e92/R/dplyr_integration.R#L216), and removing the quote chars (some characters are removed already in any case at https://github.com/DyfanJones/noctua/blob/1e34ca1aeb656485690d031e88b294cffaee2e92/R/dplyr_integration.R#L221)

What are your thoughts on this?

DyfanJones commented 3 years ago

Thanks for highlighting this to me. I wasn't aware of the change to in_schema. To my knowledge in_schema is just a function and I won't be able to create a AthenaConnection method for it. However something like this should work:

#' @rdname backend_dbplyr
db_query_fields.AthenaConnection <- function(con, sql, ...) {

  # check if sql is dbplyr schema
  in_schema <- inherits(sql, c("dbplyr_schema", "ident"))

  if(in_schema) {

    if(inherits(sql, "ident")) {
      if (grepl("\\.", sql)) {
        dbms.name <- gsub("\\..*", "" , sql)
        Table <- gsub(".*\\.", "" , sql)
      } else {
        dbms.name <- con@info$dbms.name
        Table <- sql}
    }

    if(inherits(sql, "dbplyr_schema")){
      dbms.name <- sql$schema
      Table <- sql$table
    }

    # If ident, get the fields from Glue

    tryCatch(
      output <- con@ptr$glue$get_table(DatabaseName = dbms.name,
                                       Name = Table)$Table)

    col_names = vapply(output$StorageDescriptor$Columns, function(y) y$Name, FUN.VALUE = character(1))
    partitions = vapply(output$PartitionKeys,function(y) y$Name, FUN.VALUE = character(1))

    c(col_names, partitions)

  } else { # If a subquery, query Athena for the fields
    # return dplyr methods
    sql_select <- pkg_method("sql_select", "dplyr")
    sql_subquery <- pkg_method("sql_subquery", "dplyr")
    dplyr_sql <- pkg_method("sql", "dplyr")

    sql <- sql_select(con, dplyr_sql("*"), sql_subquery(con, sql), where = dplyr_sql("0 = 1"))
    qry <- dbSendQuery(con, sql)
    on.exit(dbClearResult(qry))

    res <- dbFetch(qry, 0)
    names(res)
  }
}
OssiLehtinen commented 3 years ago

Ah, that's right, I didn't realize the dbplyr_schema object offers those parts separately.

Great and fast work as always!

DyfanJones commented 3 years ago

It looks like it is more complicated than I initially thought. It looks like dbplyr has renamed alot of backend functions from db... to dbplyr... .I will have to spend sometime investigating what has changed.

OssiLehtinen commented 3 years ago

What do you think, will these renamings make it difficult to make things simultaneously compatible with the old and new dbplyr?

DyfanJones commented 3 years ago

@OssiLehtinen from looking into the dbplyr version 2.0.0 package it actually looks like it calls the same db_query_fields function. This means noctua won't need to worry about changing back end functions from 1 version to the other. I have also managed to identify what is causing the issue.

When the parameter sql comes into db_query_fields.AthenaConnection it has already be parsed. Meaning it already has quotes around it for example: '"default"."iris"'

This is causing the issue with the current implementation of db_query_fields.AthenaConnection. To resolve we can just remove the quotation and it should work :)

#' @rdname backend_dbplyr
db_query_fields.AthenaConnection <- function(con, sql, ...) {

  # check if sql is dbplyr schema
  in_schema <- inherits(sql, "ident")

  if(in_schema) {
    if (grepl("\\.", sql)) {
      schema_parts <- gsub('"', "", strsplit(sql, "\\.")[[1]])
    } else {
      schema_parts <- c(con@info$dbms.name, gsub('"', "", sql))}

    # If dbplyr schema, get the fields from Glue
    tryCatch(
      output <- con@ptr$glue$get_table(DatabaseName = schema_parts[1],
                                       Name = schema_parts[2])$Table)

    col_names = vapply(output$StorageDescriptor$Columns, function(y) y$Name, FUN.VALUE = character(1))
    partitions = vapply(output$PartitionKeys,function(y) y$Name, FUN.VALUE = character(1))

    c(col_names, partitions)

  } else { 
    # If a subquery, query Athena for the fields
    # return dplyr methods
    sql_select <- pkg_method("sql_select", "dplyr")
    sql_subquery <- pkg_method("sql_subquery", "dplyr")
    dplyr_sql <- pkg_method("sql", "dplyr")

    sql <- sql_select(con, dplyr_sql("*"), sql_subquery(con, sql), where = dplyr_sql("0 = 1"))
    qry <- dbSendQuery(con, sql)
    on.exit(dbClearResult(qry))

    res <- dbFetch(qry, 0)
    names(res)
  }
}
DyfanJones commented 3 years ago

@OssiLehtinen merged PR #118. Please update your noctua using:

remotes::install_github("dyfanjones/noctua")

I am planning to release these updates to the cran soon. However I would like include some improvements when uploading data to aws athena: https://github.com/DyfanJones/noctua/issues/112 (if you could provide any benchmarking that would be really appreciated).

OssiLehtinen commented 3 years ago

Hi Dyfan, I'm sorry for leaving you hanging. It's just been impossible to find the time to help with the benchmarking.