OHDSI / DatabaseConnector

An R package for connecting to databases using JDBC.
http://ohdsi.github.io/DatabaseConnector/
54 stars 81 forks source link

dbplyr adds schema to columns copared in joins, that breaks BigQuery #236

Open javier-gracia-tabuenca-tuni opened 1 year ago

javier-gracia-tabuenca-tuni commented 1 year ago

Sorry, this is hard to make a reproducible example bcs is in bigquery

person <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema(database_settings$schemas$CDM, "person"))
observation_period <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema(database_settings$schemas$CDM, "observation_period"))

person |> dplyr::left_join(
    observation_period |> dplyr::select(person_id, observation_period_start_date, observation_period_end_date),
    by = "person_id"
)

with options("DEBUG_DATABASECONNECTOR_DBPLYR" = TRUE)

SQL in: SELECT TOP 11 *
FROM (
  SELECT
    atlas-development-270609.finngen_omop_r11.person.*,
    observation_period_start_date,
    observation_period_end_date
  FROM atlas-development-270609.finngen_omop_r11.person
  LEFT JOIN atlas-development-270609.finngen_omop_r11.observation_period
    ON (atlas-development-270609.finngen_omop_r11.person.person_id = atlas-development-270609.finngen_omop_r11.observation_period.person_id)
) q01
SQL out: SELECT  *
from (
  select
    atlas-development-270609.finngen_omop_r11.person.*,
    observation_period_start_date,
    observation_period_end_date
  from atlas-development-270609.finngen_omop_r11.person
  left join atlas-development-270609.finngen_omop_r11.observation_period
    on (atlas-development-270609.finngen_omop_r11.person.person_id = atlas-development-270609.finngen_omop_r11.observation_period.person_id)
) q01 LIMIT 11;
Error in collect(x, n = n) : Failed to collect lazy table.
Caused by error in `value[[3L]]()`:
! java.sql.SQLException: [Simba][BigQueryJDBCDriver](100032) Error executing query job. Message: Syntax error: Expected ")" but got "." at [4:46]

image

image

javier-gracia-tabuenca-tuni commented 1 year ago

Problems seems to come i need to add the "project" to the schema

NOT WORKING person_table <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema("atlas-development-270609.finngen_omop_r11","person"))

WORKING person_table <- dplyr::tbl(connection, DatabaseConnector::inDatabaseSchema("finngen_omop_r11","person"))

the first make a CATALOG the second a SCHEMA

image

javier-gracia-tabuenca-tuni commented 1 year ago

this should be easy to fix

https://github.com/OHDSI/DatabaseConnector/blob/5d2e0fbef8f419af1b3d9048379cb8581b87af3a/R/DBI.R#L744

but not sure what you think will be the elegant way

may be adding a parameter dbms to inDatabaseSchema , so that inDatabaseSchema behaves differently for bq and potentially others ??

javier-gracia-tabuenca-tuni commented 1 year ago

i solved it locally as this

tmp_inDatabaseSchema <- function(databaseSchema, table, dbms="") {
  if(dbms=="bigquery"){
    return(dbplyr::in_schema(databaseSchema, table))
  }
  databaseSchema <- strsplit(databaseSchema, "\\.")[[1]]
  if (length(databaseSchema) == 1) {
    return(dbplyr::in_schema(databaseSchema[1], table))
  } else {
    return(dbplyr::in_catalog(databaseSchema[1], databaseSchema[2], table))
  }
}

Im not making a PR bcs Im not sure this is the way you would like to solve it

javier-gracia-tabuenca-tuni commented 1 year ago

this is important bcs in bigquery the path to a table is in the form \<project>.\<dataset>.\<table>

Some times (data project != billing project ) the needs to be specified in order to work

schuemie commented 1 year ago

Yeah, I don't like adding the dbms parameter, because then any code written using in_database_schema() would need to provide that.

It seems to me the problem isn't so much that the project get's mistaken for a catalog, but rather that BigQuery doesn't allow the project name to appear in a join ON statement? So in your first example

SELECT  *
from (
  select
    atlas-development-270609.finngen_omop_r11.person.*,
    observation_period_start_date,
    observation_period_end_date
  from atlas-development-270609.finngen_omop_r11.person
  left join atlas-development-270609.finngen_omop_r11.observation_period
    on (atlas-development-270609.finngen_omop_r11.person.person_id = atlas-development-270609.finngen_omop_r11.observation_period.person_id)
) q01 LIMIT 11;

the SQL seems fine, but BigQuery doesn't approve of having the full reference to the tables in the ON statement. Maybe taht is what we should address?