OHDSI / DatabaseConnector

An R package for connecting to databases using JDBC.
http://ohdsi.github.io/DatabaseConnector/
54 stars 80 forks source link

Error in uploading vocabularies csv files in BigQuery #201

Closed Shalini-3 closed 1 year ago

Shalini-3 commented 2 years ago

Dear Team,

I am using the R script to upload ATHENA generated csv files into BigQuery. I have followed the discussion taken the R code from The R package to upload ATHENA-generated vocabulary csv files into DB and https://github.com/ABMI/OmopVocaManager.

I am not facing any error but the vocabularies are not loaded either. May I ask , if we can load via SQL using BQ commands and not do it using R?

konstjar commented 2 years ago

@Shalini-3 Unfortunately, it's not possible to import ATHENA CSV files directly in BQ with bq load --source_format=CSV command. The problem is in dates. BQ does not understand the date format in files and there is/was no way to specify date format.

Our solution was to implement this flow: CSV -> PostgreSQL -> CSV -> GCP Bucket -> BQ (with "bq load" command). Since we did it last year, maybe 'bq load' got some support for date parsing.

schuemie commented 2 years ago

@konstjar : I think the question is why DatabaseConnector::insertTable(), as used in OmopVocaManager, is not working for BQ. So this does not appear to be a question about bq load?

konstjar commented 2 years ago

@schuemie Yes, I answered the second question but the root cause should be investigated. I'm going to check in test environment and see where problem comes from.

@Shalini-3 Could you please share what version of BigQuery JDBC Driver you use?

Shalini-3 commented 2 years ago

@konstjar @schuemie I am using 1.2.23.1027 version of BigQuery JDBC Driver.

ablack3 commented 2 years ago

Here is a reprex showing that insertTable is not working for bigquery. I don't know the root cause of the error but it looks like a bug to me.

library(DatabaseConnector)
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
                                             connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
                                             user="",
                                             password='',
                                             pathToDriver = bqDriverPath)

con <- connect(connectionDetails)
#> Connecting using BigQuery driver

insertTable(con, databaseSchema = "scratch", "cars", cars)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Table "rgy2syz9kcrpjskstureqabhieui" must be qualified with a dataset (e.g. dataset.table).
#> An error report has been created at  /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmps34LqB/reprex-fb6e2e26366c-pious-rhino/errorReportSql.txt

#> Backtrace:
#>      ▆
#>   1. ├─DatabaseConnector::insertTable(...)
#>   2. └─DatabaseConnector:::insertTable.default(...)
#>   3.   └─DatabaseConnector:::ctasHack(...)
#>   4.     └─DatabaseConnector::executeSql(...)
#>   5.       └─base::tryCatch(...)
#>   6.         └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   7.           └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   8.             └─value[[3L]](cond)
#>   9.               └─DatabaseConnector:::.createErrorReport(connection@dbms, err$message, sqlStatement, errorReportFile)
#>  10.                 └─rlang::abort(...)

insertTable(con, databaseSchema = "scratch", "scratch.cars", cars)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Not found: Project scratch
#> An error report has been created at  /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmps34LqB/reprex-fb6e2e26366c-pious-rhino/errorReportSql.txt

#> Backtrace:
#>      ▆
#>   1. ├─DatabaseConnector::insertTable(...)
#>   2. └─DatabaseConnector:::insertTable.default(...)
#>   3.   └─DatabaseConnector::renderTranslateExecuteSql(...)
#>   4.     └─DatabaseConnector::executeSql(...)
#>   5.       └─base::tryCatch(...)
#>   6.         └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   7.           └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   8.             └─value[[3L]](cond)
#>   9.               └─DatabaseConnector:::.createErrorReport(connection@dbms, err$message, sqlStatement, errorReportFile)
#>  10.                 └─rlang::abort(...)

insertTable(con, databaseSchema = "scratch", "cars", cars, bulkLoad = FALSE)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Table "rgy2syz9lhkbpvnriyhocbuovmbd" must be qualified with a dataset (e.g. dataset.table).
#> An error report has been created at  /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmps34LqB/reprex-fb6e2e26366c-pious-rhino/errorReportSql.txt

#> Backtrace:
#>      ▆
#>   1. ├─DatabaseConnector::insertTable(...)
#>   2. └─DatabaseConnector:::insertTable.default(...)
#>   3.   └─DatabaseConnector:::ctasHack(...)
#>   4.     └─DatabaseConnector::executeSql(...)
#>   5.       └─base::tryCatch(...)
#>   6.         └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   7.           └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   8.             └─value[[3L]](cond)
#>   9.               └─DatabaseConnector:::.createErrorReport(connection@dbms, err$message, sqlStatement, errorReportFile)
#>  10.                 └─rlang::abort(...)

disconnect(con)

Created on 2022-09-07 with reprex v2.0.2

ablack3 commented 2 years ago

It looks to me like the error is occurring at these lines of ctasHack.R. I don't really understand what the "CTAS hack" (create table as select hack) is and it would be helpful to have some comments about it in the code. The problem is that we are creating a temp table (why is insertTable creating a temp table?) but temp tables are not supported on bigquery and tempEmulationSchema is null.

I was reading about temp tables on bigquery and apparently bigquery does support temp tables but they might be scoped more narrowly than other database systems. https://cloud.google.com/bigquery/docs/multi-statement-queries

image

update

If you pass tempEmulationSchema to insertTable it works!

library(DatabaseConnector)
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
                                             connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
                                             user="",
                                             password='',
                                             pathToDriver = bqDriverPath)

con <- connect(connectionDetails)
#> Connecting using BigQuery driver

insertTable(con, databaseSchema = "scratch", "cars", cars, tempEmulationSchema = "scratch")

querySql(con, "select * from scratch.cars")
#>    SPEED DIST
#> 1      4    2
#> 2      4   10
#> 3      7   22
#> 4      7    4
#> 5      8   16
#> 6      9   10
#> 7     10   26
...
executeSql(con, "drop table scratch.cars")
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.671 secs
disconnect(con)

Created on 2022-09-07 with reprex v2.0.2

ablack3 commented 2 years ago

Here is a reprex using DBI with the bigrquery driver. If you just need to upload vocab tables from R you might be able to use this.

library(bigrquery)
library(DBI)
bq_auth(path = Sys.getenv("BIGQUERY_SERVICE_ACCOUNT_JSON_PATH"))

con <- dbConnect(bigquery(),
                 project = Sys.getenv("BIGQUERY_PROJECT_ID"), 
                 dataset = "scratch")

dbWriteTable(con, "cars", cars)
dbGetQuery(con, "select * from scratch.cars")
#> # A tibble: 50 × 2
#>     dist speed
#>    <int> <int>
#>  1     2     4
#>  2    10     4
#>  3     4     7
#>  4    22     7
#>  5    16     8
#>  6    10     9
#>  7    18    10
#>  8    26    10
#>  9    34    10
#> 10    17    11
#> # … with 40 more rows
#> # ℹ Use `print(n = ...)` to see more rows
dbRemoveTable(con, "cars")
dbDisconnect(con)

Created on 2022-09-07 with reprex v2.0.2

ablack3 commented 2 years ago

@schuemie Is it fair to say that tempEmulationSchema is required for insertTable when the dbms is bigquery?

ablack3 commented 2 years ago

Created a PR that adds a simple check and error message if tempEmulationSchema is null: https://github.com/OHDSI/DatabaseConnector/pull/202

# devtools::install_github("ablack3/DatabaseConnector", "develop")
library(DatabaseConnector)
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
                                             connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
                                             user="",
                                             password='',
                                             pathToDriver = bqDriverPath)

con <- connect(connectionDetails)
#> Connecting using BigQuery driver

insertTable(con, databaseSchema = "scratch", "cars", cars)
#> Error in `insertTable()`:
#> ! tempEmulationSchema is required to use insertTable with bigquery

#> Backtrace:
#>     ▆
#>  1. ├─DatabaseConnector::insertTable(...)
#>  2. └─DatabaseConnector:::insertTable.default(con, databaseSchema = "scratch", "cars", cars)
#>  3.   └─rlang::abort("tempEmulationSchema is required to use insertTable with bigquery")
disconnect(con)

Created on 2022-09-07 with reprex v2.0.2

schuemie commented 2 years ago

The CTAS hack was developed for PDW and RedShift, where INSERT statements are absurdly slow, but CTAS tables are fast.

I don't know if the CTAS hack makes any sense on BQ, but I agree, if it does, it means tempEmulationSchema cannot be NULL.

Could someone who added the original BQ support verify the CTAS hack was turned on intentionally for BQ?

schuemie commented 2 years ago

@ablack3 : I've modified your error message to only be thrown when creating a new table. The CTAS hack is not used when inserting into an existing table.

I also added a comment about the rationale and high-level approach of the CTAS hack.

See https://github.com/OHDSI/DatabaseConnector/commit/36c7316a6549bdaa5eb09425d53145a0259891ee

konstjar commented 2 years ago

@schuemie

Could someone who added the original BQ support verify the CTAS hack was turned on intentionally for BQ?

Yes, BQ has CTAS support without any limitations.

ablack3 commented 2 years ago

@konstjar, Are we using the ctas hack on bigquery because using insert into is absurdly slow?

schuemie commented 2 years ago

@konstjar : It would be good to get an answer to this question. Alternatively, could someone just try inserting with and without the CTAS hack on BQ?

konstjar commented 1 year ago

@schuemie Sorry for delay, it looks like I missed this question. Yes, CTAS is used for BQ because of slow insert into. The same like with Redshift.