OHDSI / DatabaseConnector

An R package for connecting to databases using JDBC.
http://ohdsi.github.io/DatabaseConnector/
54 stars 79 forks source link

BigQuery connection string helper functions #171

Open azimov opened 2 years ago

azimov commented 2 years ago

Creating BigQuery connection strings appears to be an important issue for users, ideally this should be completed using createConnectionDetails where possible, utilizing the extraSettings parameter.

@jdposada has created this custom package that includes a connection string building utility function.

Perhaps this could be included in the createConnectionDetails function by adding an additional call along these lines:

createConnectionDetails <- function(function(dbms,
                                    user = NULL,
                                    password = NULL,
                                    server = NULL,
                                    port = NULL,
                                    extraSettings = NULL,
                                    oracleDriver = "thin",
                                    connectionString = NULL,
                                    pathToDriver = Sys.getenv("DATABASECONNECTOR_JAR_FOLDER"), 
                                    ...) {) {

    ...

    if (is.null(connectionString) & dbms == "bigquery") {
         result$connectionString <- createBigQueryConnectionString(...)
    }
}

createBigQueryConnectionString <- function (**requiredAndOptionalBqParams**) {
    #return connection string
}
schuemie commented 2 years ago

We may want to think about giving each DBMS its own createConnectionDetails function, with arguments specific to that DBMS? So createConnectionDetailsBigQuery(), createConnectionDetailsSqlServer(), etc.. Looping in @ablack3 , who is working on some separation of the code for the different DBMSs.

jdposada commented 2 years ago

Hi @ablack3 what do you think, should we create a separate one?

ablack3 commented 2 years ago

The direction I've been trying to go in is to use DBI as much as possible. DatabaseConnector::connect() is essentially the same function as DBI::dbConnect(). The only difference is how the required data is passed in. In most other DBI backend implementations each dbms has a driver requires a set of dbms specific parameters. DatabaseConnector is supporting multiple dbms so it seems similar to the RJDBC or odbc

## S4 method for signature 'OdbcDriver' from the odbc package
dbConnect(
  drv,
  dsn = NULL,
  ...,
  timezone = "UTC",
  timezone_out = "UTC",
  encoding = "",
  bigint = c("integer64", "integer", "numeric", "character"),
  timeout = 10,
  driver = NULL,
  server = NULL,
  database = NULL,
  uid = NULL,
  pwd = NULL,
  dbms.name = NULL,
  .connection_string = NULL
)

.connection_string | A complete connection string
... Additional ODBC keywords, these will be joined with the other arguments to form the final connection string.

The .connection_string argument and the ... arguments give plenty of flexibility in creating the connection through a single interface.

My feeling is that we should keep a single createConnectionDetails function but add Jose's createBqConnectionString function into DatabaseConnector. Connecting to big query is different than other dbms but the following seems like a pretty good interface.

jsonPath <- "/Users/jdposada/.config/gcloud/application_default_credentials.json"
bqDriverPath <- "/Users/jdposada/BqJdbcDrivers"

connectionString <-  createBQConnectionString(projectId = "som-rit-starr",
                                              defaultDataset = "jdposada_explore",
                                              authType = 2,
                                              jsonCredentialsPath = jsonPath)

connectionDetails <- DatabaseConnector::createConnectionDetails(dbms="bigquery",
                                                                connectionString=connectionString,
                                                                user="",
                                                                password='',
                                                                pathToDriver = bqDriverPath)

# Create a connection
connection <- connect(connectionDetails)

# Test with a sql query to the concept table
sql = "select * from starr-omop-deid.concept limit 10;"
concepts <- querySql(connection, sql)

How about simply adding createBQConnectionString() to DatabaseConnector?

jdposada commented 2 years ago

hi @ablack3 ,

I am all for it. I may need to ask you where to put it, so I do not break anything.

ablack3 commented 2 years ago

Great! I suggest adding it to the end of the Connect.R file here. It will need Roxygen comments and the NAMESPACE file will need to be updated by running devtools::document(). I'm glad to do it if you prefer but also happy to help you create the PR. Either way.

jdposada commented 2 years ago

hi @ablack3 I created the pull request. I only added the function and the Roxigen comments. Could you check if everything looks good there? I would like to do incrementally and being sure I am doing every step right. I will do the devtools::document() when I confirm this first part is ok. Thanks a lot.

https://github.com/OHDSI/DatabaseConnector/pull/181

ablack3 commented 2 years ago

Looks good to me! Next I'd say you should add unit test for the function. I think I'd add it to the bottom of the test-connection.R file (I'm thinking we should test the functions in connection.R in the test-connection.R file)

The test would be something like


testthat::test_that("createBQConnectionString works", {

  jsonCredentials <- '{
  "client_id": "123451234512345.apps.googleusercontent.com",
  "client_secret": "secret",
  "quota_project_id": "project",
  "refresh_token": "12345",
  "type": "authorized_user"
  }'
  path <- tempfile(fileext = ".json")
  readr::write_lines(jsonCredentials, path)

  connectionString <-  createBQConnectionString(projectId = "project",
                                                defaultDataset = "dataset",
                                                authType = 2,
                                                jsonCredentialsPath = path)

  expectedOutput <- "jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=project;DefaultDataset=dataset;OAuthType=2;OAuthClientId=123451234512345.apps.googleusercontent.com;OAuthRefreshToken=12345;OAuthClientSecret=secret;Timeout=1000;AllowLargeResults=0;EnableHighThroughputAPI=1;UseQueryCache=1;LogLevel=0;"

  testthat::expect_equal(connectionString, expectedOutput)

})

If we had a bigquery test database then we could test it on a live database. I remember talking with Greg about setting that up but I'm not sure what happened with that conversation. It would be nice to set up at some point. Until then we can just test against a known connection string.

ablack3 commented 2 years ago

We may want to think about giving each DBMS its own createConnectionDetails function, with arguments specific to that DBMS? So createConnectionDetailsBigQuery(), createConnectionDetailsSqlServer(), etc.

My proposal is to use DBI::dbConnect() to make connections and let each DBI backend provides its own dbConnect arguments. For BigQuery the 'project' is a relevant argument that doesn't make sense for other databases. Instead of createConnectionDetailsBigQuery I would suggest createConnectionDetails(drv, ...) that would accept the same arguments as dbConnect(). These arguments are then be passed into dbConnect when a connection is made. Each driver backend is responsible for communicating the information it needs to make a connection and users could connect to an OMOP CDM with any DBI backend not just the jdbc drivers supplied by DatabaseConnector.

ablack3 commented 1 year ago

Here is the script I source whenever I connect to bigquery using DatabaseConnector. Huge thanks to Jose for writing it. I think it'd be great to add to DatabaseConnector because it's not easy to come up with this connection string as a new user.

tagging @aandresalvarez who might be interested.

# Created by: jdposada
# Created on: 3/27/20
library(jsonlite)

createBQConnectionString <- function(projectId='',
                                     defaultDataset='',
                                     authType = 2,
                                     jsonCredentialsPath = '',
                                     accountEmail = '',
                                     timeOut = 1000,
                                     logLevel = 0,
                                     logPath = '',
                                     proxyHost = '',
                                     proxyPort = '',
                                     proxyPwd = '',
                                     proxyUser = '',
                                     hostPort="jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443"){

  #' Build a connection string using the JDBC symba driver
  #' https://www.simba.com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/config-intro-online.htm
  #' @details
  #' This is a helper function to ease the process of creating a connection string for the JDBC BigQuery Driver.
  #'
  #' @param projectId This is the project where you will connect and the submitted SQL queries jobs will be executed. This is not neccesarilly the project where the datasets you are interested lives, but rather the project you have permissions to execute jobs.
  #' @param defaultDataset This is the dataset where tables referenced on SQL code without a project or dataset id in the name will be created.
  #' @param authType 0 to use a service account and 2 (default) to use a pregenerated refresh token.
  #' @param jsonCredentialsPath This is the path with the json credentials. It can be from the pregenerated refresh token
  #' or the service account
  #' @param accountEmail This is the account email for the service account if authType = 0
  #' @param timeOut The length of time, in seconds, that the driver waits for a query to retrieve the results of an executed job
  #' @param logLevel 0 for desabling 6 to log everything. Details here https://www.simba
  #' .com/products/BigQuery/doc/JDBC_InstallGuide/content/jdbc/options/loglevel.htm
  #' @param logPath The full path to the folder where the driver saves log files when logging is enabled.
  #' @param proxyHost The IP address or host name of your proxy server.
  #' @param proxyPort The listening port of your proxy server.
  #' @param proxyPwd The password, if needed, for proxy server settings.
  #' @param proxyUser The user name, if needed, for proxy server settings.
  #' @param hostPort   Host and port the driver is connecting. This parameter usually does not change and its default value work
  #'
  #' @export

  ## TODO: throwing exceptions for neccessary paramters and create an R package

  projectId <- paste0("ProjectId=", projectId)
  defaultDataset <- paste0("DefaultDataset=", defaultDataset)

  if (authType == 2){
    credentials <- jsonlite::fromJSON(jsonCredentialsPath)
    clientId <- paste0("OAuthClientId=", credentials$client_id)
    refreshToken <- paste0("OAuthRefreshToken=", credentials$refresh_token)
    clientSecret <- paste0("OAuthClientSecret=", credentials$client_secret)
    keyPath <- ''
    accountEmail <- ''
  }
  else if (authType == 0){
    clientId <- ''
    refreshToken <- ''
    clientSecret <- ''
    keyPath <- paste0("OAuthPvtKeyPath=", jsonCredentialsPath)
    accountEmail <- paste0("OAuthServiceAcctEmail=", accountEmail)
  }
  else{
    print('Only service account or pregenerated tokens supported')
    return(FALSE)
  }

  authType <- paste0("OAuthType=", toString(authType))

  timeOut <- paste0("Timeout=", toString(timeOut))

  # default parameters to ensure we can retrieve large ammounts of data
  largeResults <- "AllowLargeResults=0" # this is enabled for a value of 1
  highThroughput <- "EnableHighThroughputAPI=1" # enabled with 1

  # Enabling query cache by default possibly reduces the cost
  queryCache <- "UseQueryCache=1" # 1 enables the use of cached queries

  # Logging
  if (logLevel > 0 & logPath != ''){
    logPath <- paste0("LogPath=", logPath)
  }

  logLevel <- paste0("LogLevel=", toString(logLevel))

  # Proxy configuration
  if (proxyHost != '' & proxyPort != '') {proxyHost <- paste0("ProxyHost=", proxyHost)}
  if (proxyPort != '' & proxyHost != '') {proxyPort <- paste0("ProxyPort=", proxyPort)}
  if (proxyUser != '' & proxyHost != '' & proxyPort != '') {proxyUser <- paste0("ProxyUser=", proxyUser)}
  if (proxyPwd != '' & proxyHost != '' & proxyPort != '') {proxyPwd <- paste0("ProxyPassword=", proxyPwd)}

  connectionString <- paste(c(hostPort,
                              projectId,
                              defaultDataset,
                              authType,
                              clientId,
                              refreshToken,
                              clientSecret,
                              accountEmail,
                              keyPath,
                              timeOut,
                              largeResults,
                              highThroughput,
                              queryCache,
                              logLevel,
                              logPath,
                              proxyHost,
                              proxyPort,
                              proxyUser,
                              proxyPwd
  ), collapse=";")

  connectionString <- gsub(pattern = ';{2,}', replacement = ';', connectionString)
  return(connectionString)

}