OHDSI / DatabaseConnector

An R package for connecting to databases using JDBC.
http://ohdsi.github.io/DatabaseConnector/
54 stars 79 forks source link

can I implement bulkLoad for BQ ? #288

Open javier-gracia-tabuenca-tuni opened 3 weeks ago

javier-gracia-tabuenca-tuni commented 3 weeks ago

Loading a table in BigQuery with DatabaseConnector::insertTable is quite slow

I usually have to use package bigrquery

quick dirty demo image

If you want I can make insertTable with bulkLoad=TRUE to work also BQ using bigrquery, or do you prefer other method such command line, may be more complex but nor need an extra package ??

ablack3 commented 3 weeks ago

We could have a DatabaseConnectorDbiConnection that wraps bigrquery connections similar to how DatabaseConnector wraps other DBI drivers. I think this would make sense.

javier-gracia-tabuenca-tuni commented 2 weeks ago

Thanks @ablack3, Now that you mention this, I remember than 3 years ago I made hack to use DatabaseConnector to work with bigquery using DBI, bcs JDBI was not allowed in our research enviroment.

https://github.com/FINNGEN/DatabaseConnector

It worked very well, but the code is very dirty,

I could work on making this the right way if you think that PR will be accepted ???

javier-gracia-tabuenca-tuni commented 2 weeks ago

that hack was very simple, just used bigrquery to make the DBI connection following how DatabaseConnector uses SQLite.

https://github.com/FINNGEN/DatabaseConnector/blob/a4125c6643ca3933ecb6ff8d5aa62e9b21fff1d4/R/Connect.R#L578

The dirty part comes, where I had to pass some extra parameters for bigrquery, I just added these paramenters to all the necessary functions, this should be different.

schuemie commented 2 weeks ago

@ablack3: If bigrquery has superior performance to the JDBC driver, maybe we should swap them out in DatabaseConnector? (after we've completed DatabaseConnector 7.0.0, when this should be simpler)

ablack3 commented 2 weeks ago

The dirty part comes, where I had to pass some extra parameters for bigrquery, I just added these paramenters to all the necessary functions, this should be different.

@javier-gracia-tabuenca-tuni I think we could follow the pattern used for odbc that uses createDbiConnectionDetails and connectUsingDbi

Otherwise your changes look good to me.

connectSparkUsingOdbc <- function(connectionDetails) {
  inform("Connecting using Spark ODBC driver")
  ensure_installed("odbc")
  dbiConnectionDetails <- createDbiConnectionDetails(
    dbms = connectionDetails$dbms,
    drv = odbc::odbc(),
    Driver = "Simba Spark ODBC Driver",
    Host = connectionDetails$server(),
    uid = connectionDetails$user(),
    pwd = connectionDetails$password(),
    Port = connectionDetails$port()
    # UseNativeQuery=1
  )
  if (!is.null(connectionDetails$extraSettings)) {
    dbiConnectionDetails <- append(
      dbiConnectionDetails,
      connectionDetails$extraSettings
    )
  }
  connection <- connectUsingDbi(dbiConnectionDetails)
  return(connection)
}
javier-gracia-tabuenca-tuni commented 2 weeks ago

thanks @ablack3, that is a good guidance, (that was not there when i made my hack)

I will make a fork, add the changes, work with these on my tools, and if don't find major issues, we can merge these changes after v7.0.0 as suggested by @schuemie

does it sounds like a plan ?

ablack3 commented 2 weeks ago

Hi @javier-gracia-tabuenca-tuni, after talking with @schuemie we were thinking that after v7 you should be able to use DatabaseConnector with bigrquery by using createDbiConnectionDetails.

Ideally DatabaseConnector only uses one driver for each database. Currently that is jdbc for BigQuery but it could be bigrquery in the future if all the tools work well with it. I would say we could give bigrquery a try and compare to the jdbc driver. If bigrquery seems to work better then we can switch to it in a future release. Feel free to make a branch with this change and we can test it out after v7.

javier-gracia-tabuenca-tuni commented 2 weeks ago

ok, sounds good

one more question related to BQ.

There is this forced delay for only bigquery https://github.com/OHDSI/DatabaseConnector/blob/main/R/Sql.R#L307

I coudnt find what is the reason for this. I'd like to know if this is necessary always for BQ, or ir has to do with the driver, and changing the driver will fix this also.

schuemie commented 1 week ago

BigQuery has a rate limit, especially for inserts. If you exceed it, you will get an error. To my knowledge this limit still applies.