darwin-eu / CDMConnector

A pipe friendly way to interact with an OMOP Common Data Model
https://darwin-eu.github.io/CDMConnector/
Apache License 2.0
12 stars 9 forks source link

write schema being a different DB is not tolerated #12

Open DrTTrousers opened 8 months ago

DrTTrousers commented 8 months ago

https://github.com/darwin-eu/CDMConnector/blob/b7fda01d9d1c06d7d386bb0db6f50fe6941432db/R/cdm.R#L122C4-L122C38

Not everyone has write access to the same db that the CDM is contained within.

For these situations it is helpful to be able to specify database.schema for write location.

current behaviour: supplying database.schema to writeSchema results in cdm.database.schema being created, which is invalid.

ablack3 commented 8 months ago

Yes we assume that the write_schema is in the same database as the CDM data. I'm not sure if we consider this a requirement to run studies ( a scratch schema in the same database as the CDM data) but we might.

Tagging @catalamarti, @edward-burn

We are trying to be a bit strict about requirements rather than having a lot of workarounds.

DrTTrousers commented 8 months ago

Is there a reason you would require the scratch schema to be in the same database as the CDM? That seems like a meaningless restriction, the only thing that needs to change is the string for the schema locale.

Requiring the scratch schema be in the same database as the CDM means that e.g. a site must have read only data and write data in the same db, meaning that data must be duplicated, rather than centrally shared, as we are now trying to do.

ablack3 commented 8 months ago

It's kind of a matter of how different database systems think about scoping. Our goal is to create an interface that works the same across multiple database platforms. I'm not sure that we can copy data across databases on all supported database backends though. I'll investigate. As long as that works for all supported platforms then it should be fine. But if some databases (e.g. Oracle) do not allow us to copy across databases then we need to have logic like "if the database system is snowflake then copying across databases is ok but if it is oracle then we need give an informative error."

In any case I think it is helpful to hear your perspective and the need to have cross database copying. I'm sure we can make it work but we are trying to be really clear about what we expect from a cdm. So far my understanding has been that we need a scratch schema in the same database but I could revise that requirement to simply "A scratch schema where we can copy data to".

JTBrash commented 8 months ago

Just to add to this chain. @ablack3, we have some urgency with this because, upon completion of our server migration, we will not be able to run any DARWIN studies if CDMConnector cannot accommodate a write schema in a different database.

ablack3 commented 8 months ago

Ok I'll fix this asap. Two questions... You're on snowflake right? Can you use lowercase names on your new database or only uppercase?

DrTTrousers commented 8 months ago

Hi adam, unquoted identifiers will be automatically parsed into upper case standard in the database when written, and be subsequently case insensitive as long as not quoted.

When names return in a query, these will be upper case. Any string matches will need to be case corrected, a tolower() is the easiest solution.

Any quoted identifiers will be written case sensitive and only return case sensitive using quoted identifiers

https://docs.snowflake.com/en/sql-reference/parameters#quoted-identifiers-ignore-case

ablack3 commented 8 months ago

Hi @JTBrash and @DrTTrousers,

Will you try running this code and see if it works for you. I'm using the current CRAN release and it seems to work. It also works with current development version (main branch on github) as well.

DBI has a concept of "catalog" that I think we can use to point to a write_schema in a different database.

install.packages("CDMConnector")
#> 
#> The downloaded binary packages are in
#>  /var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T//RtmpOUcEZ6/downloaded_packages

library(CDMConnector)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

con <- DBI::dbConnect(odbc::odbc(),
                      SERVER = Sys.getenv("SNOWFLAKE_SERVER"),
                      UID = Sys.getenv("SNOWFLAKE_USER"),
                      PWD = Sys.getenv("SNOWFLAKE_PASSWORD"),
                      DATABASE = "OMOP_SYNTHETIC_DATASET",
                      WAREHOUSE = "COMPUTE_WH_XS",
                      DRIVER = Sys.getenv("SNOWFLAKE_DRIVER"))

cdm_schema <- c(catalog = "OMOP_SYNTHETIC_DATASET", schema = "CDM53")

# named components are optional if there are just two. I will assume it is database, schema.
write_schema <- c(catalog = "ATLAS", schema = "RESULTS")
write_schema <- c("ATLAS", "RESULTS")

# optionally add a prefix. If a prefix is used then schema components shoule be named.
write_schema <- c(catalog = "ATLAS", schema = "RESULTS", prefix = "temp_")

cdm <- cdm_from_con(con, cdm_schema, write_schema)

new_table <- cdm$person %>%
  head(5) %>%
  computeQuery(name = "person_subset",
               temporary = FALSE,
               schema = write_schema,
               overwrite = TRUE)

new_table
#> # Source:   table<temp_person_subset> [5 x 18]
#> # Database: Snowflake 8.1.0[ohdsi@Snowflake/ATLAS]
#>   person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#>       <dbl>             <dbl>         <dbl>          <dbl>        <dbl>
#> 1         1              8507          1923              5            1
#> 2         2              8507          1943              1            1
#> 3         3              8532          1936              9            1
#> 4         4              8507          1941              6            1
#> 5         5              8507          1936              8            1
#> # ℹ 13 more variables: birth_datetime <dttm>, race_concept_id <dbl>,
#> #   ethnicity_concept_id <dbl>, location_id <dbl>, provider_id <dbl>,
#> #   care_site_id <dbl>, person_source_value <chr>, gender_source_value <chr>,
#> #   gender_source_concept_id <dbl>, race_source_value <chr>,
#> #   race_source_concept_id <dbl>, ethnicity_source_value <chr>,
#> #   ethnicity_source_concept_id <dbl>

new_table %>%
  collect() %>%
  nrow()
#> [1] 5

DBI::dbDisconnect(con)

Created on 2024-01-16 with reprex v2.0.2