OHDSI / DatabaseConnector

An R package for connecting to databases using JDBC.
http://ohdsi.github.io/DatabaseConnector/
54 stars 79 forks source link

DBI refactoring #185

Closed ablack3 closed 1 year ago

ablack3 commented 2 years ago

DatabaseConnector DBI refactoring RFC

This Request for comment document is heavily inspired by Jake Gillberg's Race & Ethnicity Vocabulary RFC and is an attempt at following the process outlined here for making "substantial" changes.

Summary

A proposal to refactor DatabaseConnector into two orthogonal/independent components: 1) A DBI back-end whose scope is only to implement the database driver functionality for all supported database management systems (dbms). 2) A user-facing front end that provides functionality currently in DatabaseConnector that is not provided by the DBI interface (e.g. storing connection details, executing multiple SQL statements, rendering, translating and then executing SQL, etc.)

Motivation

DatabaseConnector is one of the oldest OHDSI R packages and when it was developed the ecosystem for connecting to databases from R was not as mature as it is today. Currently nearly all database interactions from R passes through an interface called DBI. DBI is a set of generic methods that must be implemented by various "backend driver packages". What the DBI interface provides is the ability to use any backend driver implementation with a standard interface. There are many database backend implementations available. Some provide implementations for a single dbms while others are more general.

"I see that DatabaseConnector is already on the list of DBI supported backends so isn't it already using DBI?"

It is important to understand that DBI is a set of generic methods. DatabaseConnector implements those methods but its user facing functions do not actually use those methods. This means that OHDSI queries are not actually passing through the DBI interface which means that we cannot reap the benefit of having a standard interface and swap out different implementation of the backend. Many OHDSI queries are executed by executeSql which calls an internal function lowLevelExecuteSql which then hands the query off directly to the JBDC driver skipping the DBI interface.

There is a lot of flexibility gained by bundling up the front end and back end logic but aligning with the separation of responsibility (i.e. passing all database queries through DBI) that is used in the rest of the R ecosystem will open up several significant advantages.

Driver/backend flexibility

While JDBC is a great interface for abstracting away different database driver implementations from Java, passing our queries through DBI will allow for the use of any driver implementation with OHDSI tools. For example RPostegres is a driver for Postgres and Redshift written in C++ that could be useful to try using in the OHDSI tool stack. R and C/C++ seem to work quite well together since I think much of R is written in C.

Developer capacity

There has been some discussion this year about increasing developer resources. Maintaining the DatabaseConnector backend requires knowledge of R, Java, JDBC, the particularities associated with ~10 different database systems. Of course the maintainer also needs access to all supported database systems as well. Most other backend packages support a single database and are sometimes maintained by several people or by a company like RStudio. One way to increase our developer capacity would be to use the DBI backend implementations provided by others in the R community instead of maintaining all of the driver functionality on our own.

dplyr support

I spent a time earlier this year attempting to work on "Integration of DatabaseConnector with dplyr" and found it to be a difficult issue to completely solve. Many of the database driver backends already support dplyr. Technically this should work with any backend that supports the DBI interface.

Addition of new database platforms

The current approach to adding database platforms is to either

or

If we eliminate the DatabaseConnectorDBIConnection and build the front end functionality only on the assumption that the connection object is a DBIConnection then adding new database drivers that already have a DBI interface implemented essentially requires no changes in DatabaseConnector. For example we could simply use the DBIConnection created by duckdb instead of needing to wrap it in a new class. #153 Of course SqlRender would still need to be implemented for each new database.

Code review

Making use of standardized interfaces seems like a good design principle that helps us think about how our software is coupled together. The fact that the database driver cannot be swapped out illuminates dependencies between the driver and all of the OHDSI R tools (in Hades and beyond). At the very least I think we should review the DatabaseConnector codebase and identify opportunities for improvement.

Design Detail

I think performing this refactoring in a sequence of steps that does not break reverse dependencies is one of the more difficult aspects of this proposal for me so any advice is much appreciated.

The main goal is to make sure all OMOP queries pass through the DBI generic methods so that any DBI backend can be plugged in.

Redfine the CreateConnectionDetails interface

DBI leaves it up to the backend author to define the signature for making the connection (dbConnect).

The signature for the method that creates the connection is DBI::dbConnect(drv, ...) where the driver object is passed in first and all other arguments as passed in using ....

DatabaseConnector::createConnectionDetails() accepts arguments that correspond with a particular driver backend implementation. This is a problem because it tightly couples connectionDetails to DatabaseConnector's backend implementation.

My suggestion is to introduce a new function:

dbConnectDetails(drv, ...)

This would take and hold connection details in the same way that createConnectionDetails does today but would have the same signature as DBI::dbConnect and would accept exactly the same arguments.

So if you use the following to connect to big query with DBI and the bigrquery backend

library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)

You could use

library(DatabaseConnector)

conectionDetails <- dbConnectDetails(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)

con <- connect(connectionDetails)

to store the information needed to connect and then make a connection at a later time using any DBI backend.

DatabaseConnector::connect would simply call DBI::dbConnect.

The current implementation of DatabaseConnector::connect would become DatabaseConnectorDriver::dbConnect()

Separation of front end and back end functions

Some functions in DatabaseConnector are essentially implementations of DBI generic methods. These would be explicitly mapped to DBI generic functions.

Here is the separation:

---- DatabaseConnector frontend (works with any DBI driver) ----

connect createConnectionDetails createZipFile()

dropEmulatedTempTables() executeSql() querySql()

querySqlToAndromeda() lowLevelQuerySqlToAndromeda() - ?

renderTranslateQuerySql() renderTranslateExecuteSql() renderTranslateQueryApplyBatched() renderTranslateQuerySqlToAndromeda()

lowLevelExecuteSql() -? lowLevelQuerySql() -?

--- included for backwards compatibility but are essentially wrappers around DBI methods disconnect() <- dbDisconnect existsTable() <- dbExistsTable getTableNames() <- dbListTables insertTable() <- dbWriteTable

---- DatabaseConnector DBI Driver backend ---- DatabaseConnectorDriver()

dbAppendTable(,,) dbClearResult() dbColumnInfo() dbConnect() dbCreateTable(,,) dbDisconnect() dbExecute(,) dbExistsTable(,) dbFetch() dbGetQuery(,) dbGetRowCount() dbGetRowsAffected() dbGetStatement() dbHasCompleted() dbIsValid() dbListFields(,) dbListTables() dbQuoteIdentifier(,) dbQuoteString(,) dbReadTable(,) dbRemoveTable(,) dbSendQuery(,) dbSendStatement(,) dbUnloadDriver() dbWriteTable(,,)

show() show()

jdbcDrivers - How to download and use JDBC drivers for the various data platforms.

getAvailableJavaHeapSpace() - make internal? isSqlReservedWord() - make internal?

I think it probably makes sense to separate the front end and back end into two separate packages (DatabaseConnector & DatabaseConnectorDriver) but that could be done later on in development. As long as they are clearly separated inside the a single package we can test the ideas.

Replace dependency DatabaseConnectorConnection objects with dependency on DBIConnection objects in OHDSI tools

I'm not sure the extent to which OHDSI tools are using the special features of the DatabaseConnectorConnection objects. One common way is the use of the dbms slot that DatabaseConnectorConnection object have. Instead of directly accessing this slot I think tools could instead use a function defined in the front end that would take any DBIConnection and return the dbms. We can generally figure out the the database from the class of the connection.

This is a start and outlines the high level implementation of the refactoring I'm proposing. I'm sure many other details will be uncovered.

Drawbacks

"It works now so why change something that isn't broken?"

Well broken isn't an all or nothing thing. I've had problems with the backend driver implementations in the past and would have benefited from being able to try a different DBI backend. Often the issues are so technical or obscure or environment specific that it is impossible to describe them in a way that the developers can create a fix. I might not be the only person and we have to remember that users might not know how or where to create an issue or might assume they are doing something wrong and move on.

"This change will require changes across many OHDSI R packages"

This is true but I think we can continue to support most of the current DatabaseConnector interface. I will have to understand how OHDSI tools are making use of DatabaseConnectorConnections and how to swap out any DBI connection to make this work. I think it will be similar to the switch from ff to Andromeda. It will easier with coordination from all package maintainers that make use of DatabaseConnector.

"Some functionality might be lost"

DatabaseConnector has the ability to batch SQL statements together and send a whole batch at once to database systems that support it. This change would require the batching functionality to be exposed by the back end to the front end through the DBI interface. Basically we are passing everything through DBI so any backend driver functionality that is not supported through that interface would need to be dropped. SQL batching might be tricky as would how we handle 64 bit integers. DatabaseConnector would offload these decisions to the DBI backend which might be DatabaseConnectorDriver or some other DBI backend.

"This increases the degrees of freedom in our codebase"

There are downsides to taking on dependencies. In this case we are not so much taking on dependencies as using an interface that would allow OHDSI users to swap in whatever driver they choose. As long as we only rely on DBI this should work fine. However this means less overall control over the driver component of the software stack.

Rationale and Alternatives

The main alternative is to keep the current DatabaseConnector implementation. For the duckdb issue - we would wrap the duckdb connection in a DatabaseConnectorDbiConection object similar to how SQLite was handled. For the dplyr integration issue - we need to discover specific issues with using dplyr on DatabaseConnectorJdbcConnection objects and for DatabaseConnectorDbiConnection objects we are adding a [(unnecessary?) layer on top of DBI](https://github.com/OHDSI/DatabaseConnector/blob/4832d2080dfd57648c680035a58152fd6e6a9954/R/DBI.R#L238.

Why is this design the best in the space of possible designs?

It aligns OHDSI's use of database connections from R with the standard database interface in R. Utilizing a standard interface provides many benefits such as the ability to change components (like the driver) without breaking the whole system.

If other database backends work well then I think it will greatly reduce the maintenance required for DatabaseConnector especially if we can deprecate support for drivers and simply use drivers supported by other organizations.

Unresolved questions

I'll post questions here as they come up. This is an open discussion so please post your questions/comments here as well.

Tagging some of the Hades developers who might be interested or have comments on this: @chrisknoll, @anthonysena, @azimov, @schuemie

schuemie commented 2 years ago

Thanks! I agree with the most of what you propose. Some further items for discussion:

ablack3 commented 2 years ago

Thanks Martijn!

  • The name 'dbConnectDetails' breaks our style guide. What is wrong with 'createConnectionDetails'?

We can use createConnectionDetails but I'm not sure how to change the signature to createConnectionDetails(drv, ...) without breaking existing code. I guess I'm trying to make sure existing code continues to work with the new version of DatabaseConnector. Should the refactoring make sure existing code continues to work?

  • The createConnectionDetails() function has a feature where it does not store user credentials in memory, but rather holds references to the code by which those credentials can be obtained. We should be sure to keep that feature in the new version.

100% agree

  • Why would we still need the JDBC drivers if all DBMSs now have their own R packages?

The idea is to stop supporting the JDBC drivers to make this package easier to maintain. Again I'm thinking about making a smooth transition without breaking things so keeping the JDBC drivers would allow users to essentially opt in to the new behavior. However it would be easier and faster to drop the backend support altogether.

Here are the backend driver packages that I guess we would use.

MicrosoftSQL Server -> RSQLServer Oracle -> ROracle PostgresSql -> RPostgres Microsoft Parallel Data Warehouse (a.k.a. Analytics Platform System) -> RSQLServer?? Amazon Redshift -> RPostgres Apache Impala -> implyr Google BigQuery -> bigrquery IBM Netezza -> ??? SQLite -> RSqlLite Spark -> sparklyr

There is still a lot of uncertainty about what issues might crop up when we make this switch so I want to be careful. I'm thinking we could continue to support all dbms backends using jdbc in DatabaseConnector for a while and slowly deprecate them as we all gain confidence that an existing backend implementation will work as a replacement.

  • We still want HADES to be thoroughly tested, which becomes harder if people are allowed to bring their own DBI driver. I recommend we have a fixed set of 'supported' DBI drivers, that we also use for testing all the HADES packages.

We can allow users to plug in any driver and thoroughly test a set of drivers backends. I think the plug-ability aspect is important and is an example of the open-closed principal. If a new driver backend comes along users can immediately try it out without waiting on developers. We do this now with JDBC since we point users to the jdbc driver we test on but don't bundle a specific jdbc driver with DatabaseConnector. We make it easy to use the tested jdbc driver but don't prohibit the use of a different one.

  • Thinking further ahead, any renv lock file for a study we wish to execute should contain this set of drivers, so the study can run at each site without requiring installation of additional packages.

I think I understand what you mean. Different sites would use different drivers but the whole set of tested driver backends would be installed along with all other dependencies when running renv::restore().

ablack3 commented 2 years ago

I pushed my work so far (which is quite a mess still) and created a draft PR so that the diff could be easily viewed. https://github.com/OHDSI/DatabaseConnector/pull/186/files

I was having a hard time keeping my head on straight with all the changes so I also explored an alternative route where I just start from scratch and only implement the front end.

https://github.com/ablack3/DatabaseConnector2

ablack3 commented 2 years ago

Here is a new implementation of createConnectionDetails and connect that should work with any DBI driver backend and delay evaluation of arguments until a connection is made.

createConnectionDetails <- function(drv, ...) {
  result <- rlang::enquos(drv, ...)
  class(result) <- c("connectionDetails")
  result
}

connect <- function(connectionDetails) {
  l <- purrr::map(connectionDetails, rlang::eval_tidy)
  rlang::inject(DBI::dbConnect(!!!l))
}

s <- ":memory:"
cd <- createConnectionDetails(RSQLite::SQLite(), s)
print(cd)
#> [[1]]
#> <quosure>
#> expr: ^RSQLite::SQLite()
#> env:  global
#> 
#> [[2]]
#> <quosure>
#> expr: ^s
#> env:  global
#> 
#> attr(,"class")
#> [1] "connectionDetails"

con <- connect(cd)
DBI::dbWriteTable(con, "iris", iris)
DBI::dbGetQuery(con, "select * from iris limit 5")
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
DBI::dbDisconnect(con)

Created on 2022-07-01 by the reprex package (v2.0.1)

The signature of createConnectionDetails when using the current DatabaseConnector drivers would change to something like

 details <- createConnectionDetails(
    DatabaseConnectorDriver(),
    dbms = "postgresql",
    user = "postgres",
    password = Sys.getenv("PASSWORD"),
    server = "localhost"
  )
  connection <- connect(details)
ablack3 commented 2 years ago

executeSql and querySql have been modified to use any DBI connection. https://github.com/ablack3/DatabaseConnector/blob/issue185/R/frontend-Sql.R

Testing on duckdb here

I'm thinking this version of DatabaseConnector will depend on SqlRender but won't strictly enforce a list of supported dbms unless it has to. executeSql should work with any dbms since it only requires a DBI connection. renderTranslateExecuteSql on the other hand will only work with dbms that are supported by SqlRender.

schuemie commented 2 years ago

I like the direction this is going, but all this is based on an assumption that I'm not yet comfortable with, and that assumption is that the DBI interface is enough of a standard across all platforms. Currently the main role of DatabaseConnector is to make sure no other HADES package has code that looks like if (dbms == "oracle") do this, else do this. All platform-specific logic should be contained in DatabaseConnector. I want to be 100% it will stay that way.

Now that you've progressed this far, would it be possible to set up an experiment where we test this assumption? I'm thinking of two components we could test: 1. SQL execution, and 2. dbplyr execution. For (1), we should now be able to run for example FeatureExtraction using DBI connections, For (2) we'd need to create a reasonable complex dbplyr query. (maybe including compute() as added complexity). We could then test these at least across the platforms we have testing environments for, which includes SQL Server, PostgreSQL, Oracle, RedShift, Spark, and BigQuery.

ablack3 commented 2 years ago

Yep. I'll implement those tests on the environments you listed. I'd also like to use duckdb to showcase how we can easily add new backends with this approach. But duckdb would need to be added to SqlRender for FeatureExtraction to work. I think the SQL syntax should be the same as postgres.

There will probably be some functionality that is going to stay driver specific - like bulk loading.

ablack3 commented 2 years ago

Well I got a mostly working simple test with RPostgres! It took quite a while but here is the reprex. (It requires the test environment variables but it should work with only small modifications on any CDM with drug era populated). I had to make a change to FeatureExtraction to get it to work. I swapped out dbms(connection) for attr(connection, "dbms"). Lots more to do to get all the tests working but it's a start.

devtools::install_github("ablack3/DatabaseConnector", ref = "issue185", upgrade = "never", force = T)
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo ablack3/DatabaseConnector@issue185
#> * checking for file ‘/private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpGekit3/remotesd30f70dae5b9/ablack3-DatabaseConnector-496df2c/DESCRIPTION’ ... OK
#> * preparing ‘DatabaseConnector’:
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building ‘DatabaseConnector_5.0.4.tar.gz’
devtools::install_github("ablack3/FeatureExtraction", ref = "dbms_function", upgrade = "never", force = T)
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo ablack3/FeatureExtraction@dbms_function
#> * checking for file ‘/private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpGekit3/remotesd30f4c8bf5f2/ablack3-FeatureExtraction-ff98a71/DESCRIPTION’ ... OK
#> * preparing ‘FeatureExtraction’:
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building ‘FeatureExtraction_3.2.0.tar.gz’
library(DatabaseConnector)

library(DBI)
details <- createConnectionDetails(
  RPostgres::Postgres(),
  host     = strsplit(Sys.getenv("CDM5_POSTGRESQL_SERVER"), "/")[[1]][[1]],
  dbname   = strsplit(Sys.getenv("CDM5_POSTGRESQL_SERVER"), "/")[[1]][[2]],
  user     = Sys.getenv("CDM5_POSTGRESQL_USER"),
  password = Sys.getenv("CDM5_POSTGRESQL_PASSWORD"),
  port     = 5432
)
con <- connect(details)
schema <- Sys.getenv("CDM5_POSTGRESQL_CDM_SCHEMA")

DBI::dbIsValid(con)
#> [1] TRUE
dbms(con)
#> [1] "postgresql"

# dbGetQuery(con, "select count(*) from cdmv5.drug_era where drug_concept_id = 903963")

dbExecute(con, 
          "create temporary table tmp_cohort as 
           select 
           1 as cohort_definition_id, 
           person_id as subject_id, 
           drug_era_start_date as cohort_start_date,
           drug_era_end_date as cohort_end_date
           from cdmv5.drug_era where drug_concept_id = 903963")
#> [1] 246

ch <- dbGetQuery(con, "select * from tmp_cohort")

library(FeatureExtraction)
#> Loading required package: Andromeda
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# settings <- createDefaultCovariateSettings()

settings <- createCovariateSettings(useDemographicsAgeGroup = T)

covData <- getDbCovariateData(connection = con, 
                              cdmDatabaseSchema = Sys.getenv("CDM5_POSTGRESQL_CDM_SCHEMA"),
                              covariateSettings = settings, 
                              cohortTableIsTemp = TRUE,
                              cohortTable = "tmp_cohort",
                              aggregated = TRUE)
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger.
#> Constructing features on server
#> NOTICE:  table "cov_ref" does not exist, skipping
#> NOTICE:  table "analysis_ref" does not exist, skipping
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger.
#> Fetching data from server
#> Warning in result_create(conn@ptr, statement, immediate): Closing open result
#> set, cancelling previous query

#> Warning in result_create(conn@ptr, statement, immediate): Closing open result
#> set, cancelling previous query
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger.
#> Fetching data took 0.676 secs
#> Warning in result_create(conn@ptr, statement, immediate): Closing open result
#> set, cancelling previous query

createTable1(covData)
#>   Characteristic % (n = 246) Characteristic % (n = 246)
#> 1      Age group                   65 -  69        14.6
#> 2       30 -  34         0.4       70 -  74        17.1
#> 3       35 -  39         2.8       75 -  79        22.0
#> 4       40 -  44         1.2       80 -  84        14.2
#> 5       45 -  49         2.0       85 -  89        12.6
#> 6       50 -  54         1.2       90 -  94         6.1
#> 7       55 -  59         0.8      100 - 104         0.8
#> 8       60 -  64         4.1
disconnect(con)

Created on 2022-07-08 by the reprex package (v2.0.1)

ablack3 commented 2 years ago

I had some confusion around creating temp tables in SQL server with odbc drivers. Adding the following links for reference.

https://colorado.rstudio.com/rsc/content/7389/sqlserver_temp_tables.nb.html https://dbi.r-dbi.org/reference/dbexecute#specification-for-the-immediate-argument https://github.com/r-dbi/odbc/issues/127

ablack3 commented 2 years ago

Here is a test using odbc on SQL Server. I'm adding these tests to the extras folder since they are reverse dependency tests (I don't think I can have unit tests in DatabaseConnector that depend on FeatureExtraction).

devtools::install_github("ablack3/DatabaseConnector", ref = "issue185", upgrade = "never", force = T)
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo ablack3/DatabaseConnector@issue185
#> * checking for file ‘/private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpNoCIfH/remotes9b166b341b64/ablack3-DatabaseConnector-c209058/DESCRIPTION’ ... OK
#> * preparing ‘DatabaseConnector’:
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building ‘DatabaseConnector_5.0.4.tar.gz’
devtools::install_github("ablack3/FeatureExtraction", ref = "dbms_function", upgrade = "never", force = T)
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo ablack3/FeatureExtraction@dbms_function
#> * checking for file ‘/private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpNoCIfH/remotes9b163ee8829e/ablack3-FeatureExtraction-ff98a71/DESCRIPTION’ ... OK
#> * preparing ‘FeatureExtraction’:
#> * checking DESCRIPTION meta-information ... OK
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building ‘FeatureExtraction_3.2.0.tar.gz’

library(DatabaseConnector)
library(DBI)
details <- createConnectionDetails(
  odbc::odbc(),
  Driver   = "ODBC Driver 18 for SQL Server",
  Server   = Sys.getenv("CDM5_SQL_SERVER_SERVER"),
  Database = Sys.getenv("CDM5_SQL_SERVER_CDM_DATABASE"),
  UID      = Sys.getenv("CDM5_SQL_SERVER_USER"),
  PWD      = Sys.getenv("CDM5_SQL_SERVER_PASSWORD"),
  TrustServerCertificate="yes",
  Port     = 1433
)

options(warn=2)
con <- connect(details)
schema <- Sys.getenv("CDM5_SQL_SERVER_CDM_SCHEMA")

dbExecute(con, glue::glue(
          "select 
           1 as cohort_definition_id, 
           person_id as subject_id, 
           drug_era_start_date as cohort_start_date,
           drug_era_end_date as cohort_end_date
           into #tmp_cohort
           from {schema}.drug_era where drug_concept_id = 903963"), immediate = T)
#> [1] 246

ch <- dbGetQuery(con, "select * from #tmp_cohort;")
nrow(ch)
#> [1] 246

library(FeatureExtraction)
#> Loading required package: Andromeda
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
settings <- createDefaultCovariateSettings()
# settings <- createCovariateSettings(useDemographicsAgeGroup = T)

covData <- getDbCovariateData(connection = con, 
                              cdmDatabaseSchema = schema,
                              covariateSettings = settings, 
                              cohortTableIsTemp = TRUE,
                              cohortTable = "#tmp_cohort",
                              aggregated = TRUE)
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger.
#> Constructing features on server
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger.
#> Fetching data from server
#> Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger.
#> Fetching data took 1.42 secs

createTable1(covData)
#>                                Characteristic % (n = 246)
#> 1                                   Age group            
#> 2                                    30 -  34         0.4
#> 3                                    35 -  39         2.8
#> 4                                    40 -  44         1.2
#> 5                                    45 -  49         2.0
#> 6                                    50 -  54         1.2
#> 7                                    55 -  59         0.8
#> 8                                    60 -  64         4.1
#> 9                                    65 -  69        14.6
#> 10                                   70 -  74        17.1
#> 11                                   75 -  79        22.0
#> 12                                   80 -  84        14.2
#> 13                                   85 -  89        12.6
#> 14                                   90 -  94         6.1
#> 15                                  100 - 104         0.8
#> 16                             Gender: female        55.7
#> 17                                       Race            
#> 18           race = Black or African American         9.8
#> 19                               race = White        82.9
#> 20                                  Ethnicity            
#> 21             ethnicity = Hispanic or Latino         1.6
#> 22         ethnicity = Not Hispanic or Latino        92.7
#> 23                   Medical history: General            
#> 24                  Acute respiratory disease        32.1
#> 25   Attention deficit hyperactivity disorder         0.4
#> 26                      Chronic liver disease         4.9
#> 27           Chronic obstructive lung disease        36.6
#> 28                            Crohn's disease         2.8
#> 29                                   Dementia        18.7
#> 30                        Depressive disorder        25.6
#> 31                          Diabetes mellitus        63.4
#> 32            Gastroesophageal reflux disease        24.0
#> 33                Gastrointestinal hemorrhage        15.0
#> 34     Human immunodeficiency virus infection         2.4
#> 35                             Hyperlipidemia        67.1
#> 36                      Hypertensive disorder        76.8
#> 37                            Lesion of liver         3.3
#> 38                                    Obesity         9.3
#> 39                             Osteoarthritis        54.1
#> 40                                  Pneumonia        23.6
#> 41                                  Psoriasis         0.4
#> 42                           Renal impairment        35.0
#> 43                       Rheumatoid arthritis        11.8
#> 44                              Schizophrenia         8.9
#> 45           Urinary tract infectious disease        34.6
#> 46                          Viral hepatitis C         3.3
#> 47                     Visual system disorder        57.7
#> 48    Medical history: Cardiovascular disease            
#> 49                        Atrial fibrillation        36.6
#> 50                    Cerebrovascular disease        19.9
#> 51                  Coronary arteriosclerosis        44.3
#> 52                              Heart disease        69.9
#> 53                              Heart failure        36.2
#> 54                     Ischemic heart disease        32.1
#> 55                Peripheral vascular disease        52.0
#> 56                         Pulmonary embolism         5.7
#> 57                          Venous thrombosis        17.1
#>                                             Characteristic % (n = 246)
#> 1                               Medical history: Neoplasms            
#> 2                                     Hematologic neoplasm        17.9
#> 3                                       Malignant lymphoma         7.3
#> 4                          Malignant neoplasm of anorectum         2.4
#> 5                             Malignant neoplastic disease        51.2
#> 6                                Malignant tumor of breast        13.4
#> 7                                 Malignant tumor of colon         8.1
#> 8                                  Malignant tumor of lung         9.3
#> 9                       Malignant tumor of urinary bladder         6.5
#> 10                  Primary malignant neoplasm of prostate        16.3
#> 11                                          Medication use            
#> 12           Agents acting on the renin-angiotensin system        56.9
#> 13                         Antibacterials for systemic use        59.8
#> 14                                         Antidepressants        53.7
#> 15                                          Antiepileptics        37.0
#> 16             Antiinflammatory and antirheumatic products        38.2
#> 17                                   Antineoplastic agents        22.4
#> 18                                          Antipsoriatics         0.4
#> 19                                   Antithrombotic agents        45.9
#> 20                                    Beta blocking agents        54.1
#> 21                                Calcium channel blockers        48.4
#> 22                                               Diuretics        60.6
#> 23                        Drugs for acid related disorders        41.9
#> 24                   Drugs for obstructive airway diseases       100.0
#> 25                                  Drugs used in diabetes        45.1
#> 26                                      Immunosuppressants         7.3
#> 27                                  Lipid modifying agents        61.4
#> 28                                                 Opioids        38.6
#> 29                                           Psycholeptics        48.0
#> 30   Psychostimulants, agents used for adhd and nootropics        16.7
#> 31                                                                    
#> 32                                          Characteristic       Value
#> 33                              Charlson comorbidity index            
#> 34                                                    Mean         7.6
#> 35                                          Std. deviation         5.8
#> 36                                                 Minimum         0.0
#> 37                                         25th percentile         2.0
#> 38                                                  Median         7.0
#> 39                                         75th percentile        12.0
#> 40                                                 Maximum        24.0
#> 41                                              CHADS2Vasc            
#> 42                                                    Mean         4.7
#> 43                                          Std. deviation         2.2
#> 44                                                 Minimum         0.0
#> 45                                         25th percentile         3.0
#> 46                                                  Median         5.0
#> 47                                         75th percentile         6.0
#> 48                                                 Maximum         9.0
#> 49                                                    DCSI            
#> 50                                                    Mean         6.5
#> 51                                          Std. deviation         3.0
#> 52                                                 Minimum         0.0
#> 53                                         25th percentile         2.0
#> 54                                                  Median         7.0
#> 55                                         75th percentile        10.0
#> 56                                                 Maximum        13.0
#> 57

disconnect(con)

Created on 2022-07-15 by the reprex package (v2.0.1)

ablack3 commented 2 years ago

What should I do about the message "Currently in a tryCatch or withCallingHandlers block, so unable to add global calling handlers. ParallelLogger will not capture R messages, errors, and warnings, only explicit calls to ParallelLogger." if anything?

schuemie commented 2 years ago

Those messages should originate from calls to ParallelLogger within a try...catch block. Is that correct?

In general this is a bit of an annoyance caused by switching to the new and fancy messaging hooks in R >= 4. I've created this and this issue to attempt to soften the pain.

ablack3 commented 2 years ago

@edward-burn @schuemie, I think we could start using this branch in darwin-eu to get some experience with it. What I've already implemented should be enough to support all of the Hades packages on the Darwin roadmap. This will give us some valuable feedback about how well the DBI drivers work with Hades.

schuemie commented 2 years ago

Just for my understanding: the goal of this change would be to phase out the JDBC drivers in favor of DBI drivers, while keeping the top-level interface the same, right? So this would not require any changes in any of the HADES packages, except using the new dbms() function.

ablack3 commented 2 years ago

@schuemie, The goal of this change is to allow Hades users to choose their driver. For example if a user wants to use odbc or RPostgres they would be able to do that because DatabaseConnector would work with DBIConnections instead of wrapping them. The goal is to separate the functionality that rests on top of DBI from the functionality that belongs behind the DBI interface.

The problem is that I don't think all of DatabaseConnector's interface can be supported by any DBI driver so this would mean that some functions would only work with a DatabaseConnectorDriver and not other drivers. As an example consider insertTable. Even though there is currently an implementation for a generic DBIConnection some of the arguments are not really supported.

The change is that Hades users would be working with DBIConnections which would possibly be DatabaseConnectorConnections but could also be other implementation of DBI. For many/most functions I think this would work just fine if we change from using the dbms slot to using the dbms function. However insertTable's interface would not be supported with RPostgres or odbc drivers for example.

If this isn't the direction you want to take DatabaseConnector that's fine with me. It just means that users can only use Hades with jdbc drivers and adding new databases requires a new connection class that wraps the one provided by DBI. There are pros and cons. In this branch I've been trying to implement a version of DatabaseConnector that depends on the DBI interface instead of the drivers. (basically applying the dependency inversion principal)

Alternative directions we could go in are

The purpose of DatabaseConnector is to allow users to run the same code on all supported database systems with predictable results and for this it makes sense to use jdbc drivers. However it would be nice if the analytic tools did not force users into a particular driver and allowed for new backend driver implementations to be used without any changes to the codebase.

schuemie commented 2 years ago

I thought the idea was to keep DatabaseConnector as the high-level interface, providing the 'code once, run everywhere' functionality we need in OHDSI, while allowing DBI drivers to be used instead of JDBC drivers at a lower level. So OHDSI users would for example call insertTable(), since that would work the same across all platforms, and although by default that would use DBIs dbWriteTable(), we add extra code for special cases, like using the CTAS hack for better performance for RedShift, emulate temp tables for Oracle, and support bulk loading for Postgres.

One thing I'm unsure about is if we could just support any DBI driver, or whether we must prespecify the set of DBI drivers that work with DatabaseConnector. A good example might be ODBC: You can talk to a SQL Server server with the DBI SQL Server driver, or with the DBI ODBC driver (once an ODBC connection has been defined), but I think I remember the supported SQL dialect is actually different? I wouldn't be surprised things would break if you tried to swap the one for the other.

ablack3 commented 2 years ago

I thought the idea was to keep DatabaseConnector as the high-level interface, providing the 'code once, run everywhere' functionality we need in OHDSI, while allowing DBI drivers to be used instead of JDBC drivers at a lower level.

Yes this is the core idea of DatabaseConnector.

One thing I'm unsure about is if we could just support any DBI driver, or whether we must prespecify the set of DBI drivers that work with DatabaseConnector

What I've been implementing is the use of the DBI interface inside DatabaseConnector-frontend and moving the DatabaseConnector-backend behind the DBI driver interface. The branch allows the user of DatabaseConnector to choose which DBI driver they want to use by passing the driver into createConnectionDetails. By default the DatabaseConnector DBI driver would be chosen so current code would work not break. We would test with a specific set of drivers but the interface would allow any DBI driver to be passed in. We would not wrap DBIConnections and we would completely remove the DatabaseConnectorDbiConnection wrapper class and get rid of passthrough methods like https://github.com/OHDSI/DatabaseConnector/blob/15dfa5bc33b3b66855ab34d4708b580b2d567737/R/DBI.R#L238

Adding a new database like duckdb or snowflake would (in principle) not require any changes because users would simply pass in the driver that is implemented and maintained by someone else. This is probably optimistic as I have noticed that the DBI interface implementation differs between drivers slightly.

A good example might be ODBC: You can talk to a SQL Server server with the DBI SQL Server driver, or with the DBI ODBC driver (once an ODBC connection has been defined), but I think I remember the supported SQL dialect is actually different?

I don't know why one driver would support a different sql dialect than another for the same database. The job of the driver is simply to deliver the SQL to the database and return the result but I could be wrong.

My main question is if Hades should only work with DatabaseConnectorConnections or if it should work with DBIConnections more generally. This seems like a decision for @schuemie. I think a lot of what is needed by analytic packages can be supported by the DBI interface. However there are slight differences between drivers and implementations of DBI. I think that the current jdbc drivers should probably be maintained long term as an implementation of DBI designed to work seamlessly across multiple dbms. This would mean that all of the current functionality would be supported but other drivers would also be allowed.

Hades currently doesn't really encourage/allow taking just the parts you need and leaving the rest. Consider the decision to adopt Eunomia into a codebase that is concerned about dependencies. Eunomia is just static data in a sqlite file but requires DatabaseConnector, SqlRender, rJava as dependencies. None of those are really needed to use Eunomia. DatabaseConnector is wrapping RSqlite but RSqlite is providing the driver and it would be possible to create a version of Eunomia that would be just as useful without any of those dependencies.

schuemie commented 1 year ago

My main question is if Hades should only work with DatabaseConnectorConnections or if it should work with DBIConnections more generally.

A core requirement for HADES packages is that they work across all supported platforms. I may be slow, but I'm still not seeing how one could achieve that if people could provide a HADES package with just any DBI connection. Imagine I want to run PLP, and I use RMySql to connect to my MySQL database. PLP doesn't know how to generate SQL for MySQL, so couldn't do anything but throw an error. Even if I use a supported platform like Oracle, PLP would need DatabaseConnector's functionality to get rid of non-SQL discrepancies such as temp tables, upper or lower-case table and field names, etc. Do we really want all HADES packages to handle compatibility of the provided DBI connection with DatabaseConnector, or should we let DatabaseConnector handle that (by enforcing that all connections are DatabaseConnectorConnections)?

So my proposal would be that all HADES package work with DatabaseConnectorConnections, and that under the hood we slowly replace the JDBC drivers with DBI ones.

schuemie commented 1 year ago

After discussing with @ablack3 for a while, I’d like to propose the following approach:

A truly platform-agnostic connection wrapper

The main function of DatabaseConnectorConnection class will be to wrap DBI connections. DBI connections provide some standardization (in terms of R methods), but still display idiosyncratic behavior in other areas, such as different SQL needed for different platforms, but also whether output column names are upper case or lower case, etc. The purpose of the DatabaseConnectorConnection wrapper is to hide all those differences, so we get a connection object that is truly abstract, with the exact same behavior independent of which database platform you’re using. HADES developers can rely on the DatabaseConnectorConnection to write code once, and expect it to run everywhere (which is not possible by relying on DBI alone).

DBI compliance and dbplyr

The new DatabaseConnectorConnection should be 100% DBI compliant. Our current implementation already is, but DBI functions like dbExecute() currently do not use SQL translation. I propose the new DBI implementation does apply SQL translation for DBI functions, so that even at the DBI level the DatabaseConnectorConnection is platform-agnostic. A nice benefit of this approach is that we can use dbplyr with this new DBI implementation. The dbplyr backend will need to know how to translate the dplyr statements to OHDSISQL, which should be easy because OHDSISQL is basically SQL Server SQL. Under the hood, when dbplyr called the DBI functions of DatabaseConnectorConnection, these SQL statements get translated to the required SQL dialect on which we’re running. Although it seem complex that dbplyr generates SQL in one dialect, which we then translate to another dialect, instead of directly generating the appropriate dialect, the advantage is we have to have only one code-base for translation (SqlRender).

DBI under the hood

Currently, DatabaseConnector only uses a DBI driver to talk to SQLite. All other platforms are directly supported using JDBC. Another change we could make is to wrap every JDBC driver in its own DBI driver, which in turn can be wrapped by DatabaseConnectorConnection. DatabaseConnectorConnection will then become a clean DBI wrapper, and one of its constructors could simply take a DBI connection (any DBI connection) as input. However, I suspect we will find that we want to have a fixed set of DBI drivers that we support and have tested. One advantage of this approach is that we can start switching our JDBC drivers for existing DBI drivers whenever we find the DBI drivers are mature enough.

Helper functions

DatabaseConnector currently has several functions I’ll refer to as ‘helper functions’. These include executeSql() and querySql(). The main advantage they provide is that if the SQL execution throws an error, an error report is written, which greatly facilitates remote debugging. executeSql() also nicely splits multi-statement SQL, and provide a progress bar. We can choose to have these helper functions be part of the DatabaseConnectorConnection interface, or just have them be separate functions in the DatabaseConnector package that can be applied to any DBI connection. I would argue that, for convenience, we should add them to the DatabaseConnectorConnection interface (which inherits from DBI). I don’t see any added value in having the completely separate, since then we cannot be guaranteed we’re working on a DatabaseConnectorConnection.

Overview

I’ve tried to depict this approach in the picture below, showing our current SQL Server JDBC driver wrapped in a DBI interface. I also show the RPostgreSQL package, which could completely replace our Postgres JDBC driver. Both DBI interfaces are further wrapped in DatabaseConnectorConnection, providing a truly database-agnostic interface. Both dbplyr and HADES tools would talk to this layer, and not have to worry about the fact that at different sites there will be different database platforms they’re talking to.

image

ablack3 commented 1 year ago

Thanks @schuemie. I think we have a similar idea of the direction here.

HADES developers can rely on the DatabaseConnectorConnection to write code once, and expect it to run everywhere (which is not possible by relying on DBI alone).

It might be informative to think about why this is not possible using DBI alone because DBI is pretty close to being database agnostic. It just seems like there are slight differences in the implementations of the drivers. I kind of see DatabaseConnector as an extension of DBI that provides the "write once run everywhere" guarantee when using the DatabaseConnector DBI driver implementation.

DatabaseConnectorConnection will then become a clean DBI wrapper, and one of its constructors could simply take a DBI connection (any DBI connection) as input. However, I suspect we will find that we want to have a fixed set of DBI drivers that we support and have tested. One advantage of this approach is that we can start switching our JDBC drivers for existing DBI drivers whenever we find the DBI drivers are mature enough.

The ability for users to choose their driver in the same way that DBI allows users to choose their driver seems quite important to me. I believe this is a design pattern called dependency injection.

For next steps I'd like to try making rather small non-breaking changes to DatabaseConnector rather than trying to do one big change and see how it goes.

schuemie commented 1 year ago

It might be informative to think about why this is not possible using DBI alone

DBI provides standards for how to send and receive data to the server, it does not standardize what that data is, including what the commands are the server accepts. Dbplyr tries to provide some standardization of the content (via various backends for the various databases), but even there I don't see any efforts to make them exactly the same.

I think the root cause is that any standard still leaves ambiguity. Only in the context of actual running software (like HADES) can you make guarantees of universal behavior across platforms for the functionality the software supports.

The ability for users to choose their driver in the same way that DBI allows users to choose their driver seems quite important to me.

I know you think that is important, so I want to make sure to support it. However, I predict this will fail (badly), because new drivers will always do thing slightly differently, in ways we can't foresee. I suspect we'll have a fixed set of DBI drivers we support, and any other driver will break things.

For next steps I'd like to try making rather small non-breaking changes

I would like to prioritize making DatabaseConnector's DBI interface truly abstract, so adding translation to dbExecute(), etc., so we can hook up dbplyr (with the SQL Server backend).

ablack3 commented 1 year ago

Here's my approach applied to getTableNames which now works with several DBI backends including DatabaseConnector, RPostgres, odbc, and duckdb. If it looks ok I'll add tests for the various drivers this function now supports. https://github.com/OHDSI/DatabaseConnector/pull/209

ablack3 commented 1 year ago

universal behavior across platforms

A couple places I've found differences between platforms:

SqlRender::translate("SELECT * INTO UPPERCASE_TABLE_NAME FROM data;", "bigquery")
#> [1] "CREATE TABLE uppercase_table_name  AS\nSELECT\n* \nFROM\ndata;"

library(DatabaseConnector)
con <- connect(dbms = "sqlite", server = ":memory:")
#> Connecting using SQLite driver
DBI::dbWriteTable(con, data.frame(not_a_date = 1), name = "tmp")

querySql(con, "select * from tmp") # similar behavior for impala 
#>   NOT_A_DATE
#> 1 1970-01-01

# DatabaseConnector's dbGetQuery does not make the type conversion based on the column name
DatabaseConnector::dbGetQuery(con, "select * from tmp") 
#>   not_a_date
#> 1          1

disconnect(con)

Created on 2022-11-02 with reprex v2.0.2

schuemie commented 1 year ago

looks to me like DatabaseConnector::querySql coerces any columns with the substring "DATE" in the name column name to a date type on impala and sqlite but not on any other supported platform.

Yes, that is because those platforms don't support dates, so this code is making them behave more similar to the other platforms, not different. (It's a hack, I know, but the only alternatives are: (a) not allow dates on any platform, or (b) not allow these platforms)

ablack3 commented 1 year ago

At first glance it looks like Impala supports a date type https://impala.apache.org/docs/build/html/topics/impala_date.html but maybe I'm missing some technical detail. As far as I know we don't have an impala database to test against. I know SQLite does not have a date type but the RSQLite package must have a way of handling this without using the column name.

library(DBI)
con <- dbConnect(RSQLite::SQLite(), extended_types = TRUE)
dbWriteTable(con, "tmp", data.frame(a_date = Sys.Date(), not_a_date = 1))
tibble::tibble(dbGetQuery(con, "select * from tmp"))
#> # A tibble: 1 × 2
#>   a_date     not_a_date
#>   <date>          <dbl>
#> 1 2022-11-03          1
dbDisconnect(con)

Created on 2022-11-03 with reprex v2.0.2

ablack3 commented 1 year ago

This was a fruitful exploration for me and I appreciate all the support and dialogue. I’m continuing development of the “front end” that works with any DBI connection in the CDMConnector package. I might get to present that work at OHDSI Europe.