Closed Shalini-3 closed 1 year ago
@Shalini-3 Unfortunately, it's not possible to import ATHENA CSV files directly in BQ with bq load --source_format=CSV
command. The problem is in dates. BQ does not understand the date format in files and there is/was no way to specify date format.
Our solution was to implement this flow: CSV -> PostgreSQL -> CSV -> GCP Bucket -> BQ (with "bq load" command). Since we did it last year, maybe 'bq load' got some support for date parsing.
@konstjar : I think the question is why DatabaseConnector::insertTable()
, as used in OmopVocaManager, is not working for BQ. So this does not appear to be a question about bq load
?
@schuemie Yes, I answered the second question but the root cause should be investigated. I'm going to check in test environment and see where problem comes from.
@Shalini-3 Could you please share what version of BigQuery JDBC Driver you use?
@konstjar @schuemie I am using 1.2.23.1027 version of BigQuery JDBC Driver.
Here is a reprex showing that insertTable is not working for bigquery. I don't know the root cause of the error but it looks like a bug to me.
library(DatabaseConnector)
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
user="",
password='',
pathToDriver = bqDriverPath)
con <- connect(connectionDetails)
#> Connecting using BigQuery driver
insertTable(con, databaseSchema = "scratch", "cars", cars)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Table "rgy2syz9kcrpjskstureqabhieui" must be qualified with a dataset (e.g. dataset.table).
#> An error report has been created at /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmps34LqB/reprex-fb6e2e26366c-pious-rhino/errorReportSql.txt
#> Backtrace:
#> ▆
#> 1. ├─DatabaseConnector::insertTable(...)
#> 2. └─DatabaseConnector:::insertTable.default(...)
#> 3. └─DatabaseConnector:::ctasHack(...)
#> 4. └─DatabaseConnector::executeSql(...)
#> 5. └─base::tryCatch(...)
#> 6. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 7. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 8. └─value[[3L]](cond)
#> 9. └─DatabaseConnector:::.createErrorReport(connection@dbms, err$message, sqlStatement, errorReportFile)
#> 10. └─rlang::abort(...)
insertTable(con, databaseSchema = "scratch", "scratch.cars", cars)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Not found: Project scratch
#> An error report has been created at /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmps34LqB/reprex-fb6e2e26366c-pious-rhino/errorReportSql.txt
#> Backtrace:
#> ▆
#> 1. ├─DatabaseConnector::insertTable(...)
#> 2. └─DatabaseConnector:::insertTable.default(...)
#> 3. └─DatabaseConnector::renderTranslateExecuteSql(...)
#> 4. └─DatabaseConnector::executeSql(...)
#> 5. └─base::tryCatch(...)
#> 6. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 7. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 8. └─value[[3L]](cond)
#> 9. └─DatabaseConnector:::.createErrorReport(connection@dbms, err$message, sqlStatement, errorReportFile)
#> 10. └─rlang::abort(...)
insertTable(con, databaseSchema = "scratch", "cars", cars, bulkLoad = FALSE)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Table "rgy2syz9lhkbpvnriyhocbuovmbd" must be qualified with a dataset (e.g. dataset.table).
#> An error report has been created at /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmps34LqB/reprex-fb6e2e26366c-pious-rhino/errorReportSql.txt
#> Backtrace:
#> ▆
#> 1. ├─DatabaseConnector::insertTable(...)
#> 2. └─DatabaseConnector:::insertTable.default(...)
#> 3. └─DatabaseConnector:::ctasHack(...)
#> 4. └─DatabaseConnector::executeSql(...)
#> 5. └─base::tryCatch(...)
#> 6. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 7. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 8. └─value[[3L]](cond)
#> 9. └─DatabaseConnector:::.createErrorReport(connection@dbms, err$message, sqlStatement, errorReportFile)
#> 10. └─rlang::abort(...)
disconnect(con)
Created on 2022-09-07 with reprex v2.0.2
It looks to me like the error is occurring at these lines of ctasHack.R. I don't really understand what the "CTAS hack" (create table as select hack) is and it would be helpful to have some comments about it in the code. The problem is that we are creating a temp table (why is insertTable creating a temp table?) but temp tables are not supported on bigquery and tempEmulationSchema is null.
I was reading about temp tables on bigquery and apparently bigquery does support temp tables but they might be scoped more narrowly than other database systems. https://cloud.google.com/bigquery/docs/multi-statement-queries
update
If you pass tempEmulationSchema to insertTable it works!
library(DatabaseConnector)
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
user="",
password='',
pathToDriver = bqDriverPath)
con <- connect(connectionDetails)
#> Connecting using BigQuery driver
insertTable(con, databaseSchema = "scratch", "cars", cars, tempEmulationSchema = "scratch")
querySql(con, "select * from scratch.cars")
#> SPEED DIST
#> 1 4 2
#> 2 4 10
#> 3 7 22
#> 4 7 4
#> 5 8 16
#> 6 9 10
#> 7 10 26
...
executeSql(con, "drop table scratch.cars")
#> | | | 0% | |======================================================================| 100%
#> Executing SQL took 0.671 secs
disconnect(con)
Created on 2022-09-07 with reprex v2.0.2
Here is a reprex using DBI with the bigrquery driver. If you just need to upload vocab tables from R you might be able to use this.
library(bigrquery)
library(DBI)
bq_auth(path = Sys.getenv("BIGQUERY_SERVICE_ACCOUNT_JSON_PATH"))
con <- dbConnect(bigquery(),
project = Sys.getenv("BIGQUERY_PROJECT_ID"),
dataset = "scratch")
dbWriteTable(con, "cars", cars)
dbGetQuery(con, "select * from scratch.cars")
#> # A tibble: 50 × 2
#> dist speed
#> <int> <int>
#> 1 2 4
#> 2 10 4
#> 3 4 7
#> 4 22 7
#> 5 16 8
#> 6 10 9
#> 7 18 10
#> 8 26 10
#> 9 34 10
#> 10 17 11
#> # … with 40 more rows
#> # ℹ Use `print(n = ...)` to see more rows
dbRemoveTable(con, "cars")
dbDisconnect(con)
Created on 2022-09-07 with reprex v2.0.2
@schuemie Is it fair to say that tempEmulationSchema
is required for insertTable
when the dbms is bigquery?
Created a PR that adds a simple check and error message if tempEmulationSchema is null: https://github.com/OHDSI/DatabaseConnector/pull/202
# devtools::install_github("ablack3/DatabaseConnector", "develop")
library(DatabaseConnector)
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
user="",
password='',
pathToDriver = bqDriverPath)
con <- connect(connectionDetails)
#> Connecting using BigQuery driver
insertTable(con, databaseSchema = "scratch", "cars", cars)
#> Error in `insertTable()`:
#> ! tempEmulationSchema is required to use insertTable with bigquery
#> Backtrace:
#> ▆
#> 1. ├─DatabaseConnector::insertTable(...)
#> 2. └─DatabaseConnector:::insertTable.default(con, databaseSchema = "scratch", "cars", cars)
#> 3. └─rlang::abort("tempEmulationSchema is required to use insertTable with bigquery")
disconnect(con)
Created on 2022-09-07 with reprex v2.0.2
The CTAS hack was developed for PDW and RedShift, where INSERT statements are absurdly slow, but CTAS tables are fast.
I don't know if the CTAS hack makes any sense on BQ, but I agree, if it does, it means tempEmulationSchema
cannot be NULL.
Could someone who added the original BQ support verify the CTAS hack was turned on intentionally for BQ?
@ablack3 : I've modified your error message to only be thrown when creating a new table. The CTAS hack is not used when inserting into an existing table.
I also added a comment about the rationale and high-level approach of the CTAS hack.
See https://github.com/OHDSI/DatabaseConnector/commit/36c7316a6549bdaa5eb09425d53145a0259891ee
@schuemie
Could someone who added the original BQ support verify the CTAS hack was turned on intentionally for BQ?
Yes, BQ has CTAS support without any limitations.
@konstjar, Are we using the ctas hack on bigquery because using insert into
is absurdly slow?
@konstjar : It would be good to get an answer to this question. Alternatively, could someone just try inserting with and without the CTAS hack on BQ?
@schuemie Sorry for delay, it looks like I missed this question. Yes, CTAS is used for BQ because of slow insert into
. The same like with Redshift.
Dear Team,
I am using the R script to upload ATHENA generated csv files into BigQuery. I have followed the discussion taken the R code from The R package to upload ATHENA-generated vocabulary csv files into DB and https://github.com/ABMI/OmopVocaManager.
I am not facing any error but the vocabularies are not loaded either. May I ask , if we can load via SQL using BQ commands and not do it using R?