OHDSI / SqlRender

This is an R package and Java library for rendering parameterized SQL, and translating it to different SQL dialects.
https://ohdsi.github.io/SqlRender
Other
80 stars 77 forks source link

Correct translation of quotes (e.g for `offset` field in `nlp_note`) #341

Open ablack3 opened 1 year ago

ablack3 commented 1 year ago

Is this query valid ohdsi-sql?

select offset from cdmv5.note_nlp;

If it is not valid then how would a new user know that this is incorrect OHDSI-SQL and how to fix it?

If it is valid then I think there is a bug in translation.

library(DatabaseConnector)
connectionDetails <- createConnectionDetails(dbms = "postgresql",
                                             server = "server",
                                             user = "user",
                                             password = Sys.getenv("PASSWORD"),
                                             port = "5444")
con <- connect(connectionDetails)
#> Connecting using PostgreSQL driver

sql <- SqlRender::translate("select * from cdm_531.note_nlp;", DatabaseConnector::dbms(con))
querySql(con, sql)
#>  [1] NOTE_NLP_ID                NOTE_ID                   
#>  [3] SECTION_CONCEPT_ID         SNIPPET                   
#>  [5] OFFSET                     LEXICAL_VARIANT           
#>  [7] NOTE_NLP_CONCEPT_ID        NOTE_NLP_SOURCE_CONCEPT_ID
#>  [9] NLP_SYSTEM                 NLP_DATE                  
#> [11] NLP_DATETIME               TERM_EXISTS               
#> [13] TERM_TEMPORAL              TERM_MODIFIERS            
#> <0 rows> (or 0-length row.names)

sql <- SqlRender::translate("select offset from cdm_531.note_nlp;", DatabaseConnector::dbms(con))
querySql(con, sql)
#> Error in `.createErrorReport()`:
#> ! Error executing SQL:
#> org.postgresql.util.PSQLException: ERROR: syntax error at or near "from"
#>   Position: 15
#> An error report has been created at  /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpRi3syZ/reprex-7d7bbb05e3a-vivid-eider/errorReportSql.txt
#> Backtrace:
#>     ▆
#>  1. └─DatabaseConnector::querySql(con, sql)
#>  2.   └─base::tryCatch(...)
#>  3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  5.         └─value[[3L]](cond)
#>  6.           └─DatabaseConnector:::.createErrorReport(dbms(connection), err$message, sql, errorReportFile)
#>  7.             └─rlang::abort(...)

disconnect(con)

Created on 2023-08-09 with reprex v2.0.2

contents of errorReportSql.txt

DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: syntax error at or near "from"
  Position: 15

SQL:
select offset from cdmv5.note_nlp limit 1;

R version:
R version 4.2.2 (2022-10-31)

Platform:
x86_64-apple-darwin17.0

Attached base packages:
- stats
- graphics
- grDevices
- utils
- datasets
- methods
- base

Other attached packages:
- DatabaseConnector (6.2.3)
- testthat (3.1.10)
schuemie commented 1 year ago

It appears 'offset' is a reserved word in PostgreSQL. It is unfortunate that it is used in the CDM specifications. It seems the CDM folks are aware of this, since the field name is quoted on the CDM website.

So it is not valid OhdsiSql, but to your point, there's no way a user would know this. The current workaround would be to quote the field name, but that would make the field name case sensitive as well, which might cause problems for some sites in OHDSI.

ablack3 commented 1 year ago

The current workaround would be to quote the field name

I tried testing SELECT "offset" FROM @cdmDatabaseSchema.NOTE_NLP;' across different dbms with renderTranslateQuerySql with some success. What makes it a bit time consuming is that not all of the ohdsi test databases have the note_nlp table. The quoted column name has to be lowercase for it to work and I'm still not sure it works on all OHDSI databases.

In general, how do I know that my SQL is valid OHDSI-SQL or not?

ablack3 commented 1 year ago

The workaround seems to work across many of the ohdsi platforms but I think there is an issue with bigquery.

library(DatabaseConnector)

# cross platform sql
ohdsisql <- 'SELECT "offset" FROM @cdmDatabaseSchema.NOTE_NLP;'

# postgres ----
connection <- connect(dbms = "postgresql", 
                      user = "postgres", 
                      password = "", 
                      server = "localhost/covid")
#> Connecting using PostgreSQL driver

cdmDatabaseSchema <- "cdm5"

renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)

# oracle ----
connectionDetails <- createConnectionDetails(dbms = "oracle",
                                             user = Sys.getenv("CDM5_ORACLE_USER"),
                                             password = Sys.getenv("CDM5_ORACLE_PASSWORD"),
                                             server = Sys.getenv("CDM5_ORACLE_SERVER"))

connection <- connect(connectionDetails)
#> Connecting using Oracle driver
#> - using THIN to connect
cdmDatabaseSchema <- Sys.getenv("CDM5_ORACLE_CDM_SCHEMA")
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = "OHDSI") 
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)

# redshift ----
connection <- connect(dbms = "redshift",
                      server = Sys.getenv("CDM5_REDSHIFT_SERVER"),
                      user = Sys.getenv("CDM5_REDSHIFT_USER"),
                      password = Sys.getenv("CDM5_REDSHIFT_PASSWORD"),
                      port = Sys.getenv("CDM5_REDSHIFT_PORT"))
#> Connecting using Redshift driver

cdmDatabaseSchema <- Sys.getenv("CDM5_REDSHIFT_CDM_SCHEMA")
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)

# snowflake ---- 
connection <- connect(
  dbms = "snowflake",
  connectionString = Sys.getenv("SNOWFLAKE_CONNECTION_STRING"),
  user = Sys.getenv("SNOWFLAKE_USER"),
  password = Sys.getenv("SNOWFLAKE_PASSWORD")
)
#> Connecting using Snowflake driver

cdmDatabaseSchema <- Sys.getenv("SNOWFLAKE_CDM_SCHEMA")
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)

# bigquery ----
bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
connectionDetails <- createConnectionDetails(dbms="bigquery",
                                             connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
                                             user="",
                                             password='',
                                             pathToDriver = bqDriverPath)

connection <- connect(connectionDetails)
#> Connecting using BigQuery driver
cdmDatabaseSchema <- "synpuf_110k"
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] F0_
#> <0 rows> (or 0-length row.names)
disconnect(connection)

# The returned column name is unexpected on bigquery. Maybe I'm doing something wrong here...

# sql server ----
connectionDetails <- createConnectionDetails(
  dbms = "sql server",
  server = Sys.getenv("CDM5_SQL_SERVER_SERVER"),
  user = Sys.getenv("CDM5_SQL_SERVER_USER"),
  password = Sys.getenv("CDM5_SQL_SERVER_PASSWORD"),
  port = Sys.getenv("CDM5_SQL_SERVER_PORT")
)

connection <- connect(connectionDetails)
#> Connecting using SQL Server driver
cdmDatabaseSchema <- "cdmv54.dbo"
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)

# sqlite ----
connectionDetails <- Eunomia::getEunomiaConnectionDetails()
connection <- connect(connectionDetails)
#> Connecting using SQLite driver
renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = "main")
#> [1] OFFSET
#> <0 rows> (or 0-length row.names)
disconnect(connection)

Created on 2023-08-11 with reprex v2.0.2

ablack3 commented 1 year ago

In general, quoting columns seems to work on some database but not all.

library(DatabaseConnector)

testOhdsiSql <- function(ohdsisql) {

  renderTranslateQuerySql <- purrr::safely(renderTranslateQuerySql, quiet = F)

  cli::cat_rule("postgres ----")
  cli::cat_line()
  connection <- connect(dbms = "postgresql", 
                        user = "postgres", 
                        password = "", 
                        server = "localhost/covid")

  cdmDatabaseSchema <- "cdm5"

  print(dplyr::tibble(
    renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
  ))

  disconnect(connection)

  cli::cat_rule("oracle ----")
  connectionDetails <- createConnectionDetails(dbms = "oracle",
                                               user = Sys.getenv("CDM5_ORACLE_USER"),
                                               password = Sys.getenv("CDM5_ORACLE_PASSWORD"),
                                               server = Sys.getenv("CDM5_ORACLE_SERVER"))

  connection <- connect(connectionDetails)
  cdmDatabaseSchema <- Sys.getenv("CDM5_ORACLE_CDM_SCHEMA")
  print(dplyr::tibble(
    renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
  ))
  disconnect(connection)

  cli::cat_rule("redshift ----")
  cli::cat_line()
  connection <- connect(dbms = "redshift",
                        server = Sys.getenv("CDM5_REDSHIFT_SERVER"),
                        user = Sys.getenv("CDM5_REDSHIFT_USER"),
                        password = Sys.getenv("CDM5_REDSHIFT_PASSWORD"),
                        port = Sys.getenv("CDM5_REDSHIFT_PORT"))

  cdmDatabaseSchema <- Sys.getenv("CDM5_REDSHIFT_CDM_SCHEMA")
  print(dplyr::tibble(
    renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
  ))
  disconnect(connection)

  cli::cat_rule("snowflake ----")
  cli::cat_line()
  connection <- connect(
    dbms = "snowflake",
    connectionString = Sys.getenv("SNOWFLAKE_CONNECTION_STRING"),
    user = Sys.getenv("SNOWFLAKE_USER"),
    password = Sys.getenv("SNOWFLAKE_PASSWORD")
  )

  cdmDatabaseSchema <- Sys.getenv("SNOWFLAKE_CDM_SCHEMA")
  print(dplyr::tibble(
    renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
  ))
  disconnect(connection)

  cli::cat_rule("bigquery ----")
  cli::cat_line()
  bqDriverPath <- "/Users/adamblack/jdbc_drivers/SimbaJDBCDriverforGoogleBigQuery42_1.2.22.1026"
  connectionDetails <- createConnectionDetails(dbms="bigquery",
                                               connectionString=Sys.getenv("BIGQUERY_CONNECTION_STRING"),
                                               user="",
                                               password='',
                                               pathToDriver = bqDriverPath)

  connection <- connect(connectionDetails)
  cdmDatabaseSchema <- "synpuf_110k"
  print(dplyr::tibble(
    renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
  ))

  disconnect(connection)

  cli::cat_rule("sql server ----")
  cli::cat_line()
  connectionDetails <- createConnectionDetails(
    dbms = "sql server",
    server = Sys.getenv("CDM5_SQL_SERVER_SERVER"),
    user = Sys.getenv("CDM5_SQL_SERVER_USER"),
    password = Sys.getenv("CDM5_SQL_SERVER_PASSWORD"),
    port = Sys.getenv("CDM5_SQL_SERVER_PORT")
  )

  connection <- connect(connectionDetails)
  cdmDatabaseSchema <- "cdmv54.dbo"
  print(dplyr::tibble(
    renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = cdmDatabaseSchema)[[1]]
  ))
  disconnect(connection)

  cli::cat_rule("sqlite ----")
  cli::cat_line()
  connectionDetails <- Eunomia::getEunomiaConnectionDetails()
  connection <- connect(connectionDetails)
    print(dplyr::tibble(
      renderTranslateQuerySql(connection, ohdsisql, cdmDatabaseSchema = "main")[[1]]
    ))

  disconnect(connection)
}

# cross platform sql
testOhdsiSql('SELECT TOP 1 "person_id" FROM @cdmDatabaseSchema.person;')
#> ── postgres ---- ───────────────────────────────────────────────────────────────
#> Connecting using PostgreSQL driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         1
#> ── oracle ---- ─────────────────────────────────────────────────────────────────
#> Connecting using Oracle driver
#> - using THIN to connect
#> Error: Error executing SQL:
#> java.sql.SQLSyntaxErrorException: ORA-00904: "person_id": invalid identifier
#> 
#> An error report has been created at  /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmp1SsB43/reprex-f4484725db51-cocky-doe/errorReportSql.txt
#> # A tibble: 0 × 0
#> ── redshift ---- ───────────────────────────────────────────────────────────────
#> Connecting using Redshift driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         2
#> ── snowflake ---- ──────────────────────────────────────────────────────────────
#> Connecting using Snowflake driver
#> Error: Error executing SQL:
#> net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error: error line 1 at position 8
#> invalid identifier '"person_id"'
#> An error report has been created at  /private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/Rtmp1SsB43/reprex-f4484725db51-cocky-doe/errorReportSql.txt
#> # A tibble: 0 × 0
#> ── bigquery ---- ───────────────────────────────────────────────────────────────
#> Connecting using BigQuery driver
#> # A tibble: 1 × 1
#>   F0_      
#>   <chr>    
#> 1 person_id
#> ── sql server ---- ─────────────────────────────────────────────────────────────
#> Connecting using SQL Server driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         0
#> ── sqlite ---- ─────────────────────────────────────────────────────────────────
#> Connecting using SQLite driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         6
testOhdsiSql('SELECT TOP 1 person_id FROM @cdmDatabaseSchema.person;')
#> ── postgres ---- ───────────────────────────────────────────────────────────────
#> Connecting using PostgreSQL driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         1
#> ── oracle ---- ─────────────────────────────────────────────────────────────────
#> Connecting using Oracle driver
#> - using THIN to connect
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1       540
#> ── redshift ---- ───────────────────────────────────────────────────────────────
#> Connecting using Redshift driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         2
#> ── snowflake ---- ──────────────────────────────────────────────────────────────
#> Connecting using Snowflake driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         1
#> ── bigquery ---- ───────────────────────────────────────────────────────────────
#> Connecting using BigQuery driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1    112989
#> ── sql server ---- ─────────────────────────────────────────────────────────────
#> Connecting using SQL Server driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         0
#> ── sqlite ---- ─────────────────────────────────────────────────────────────────
#> Connecting using SQLite driver
#> # A tibble: 1 × 1
#>   PERSON_ID
#>       <dbl>
#> 1         6

Created on 2023-08-11 with reprex v2.0.2

schuemie commented 11 months ago

Thanks! This is really helpful.

The weird behavior on BigQuery is because it uses backticks instead of double quotes. I added a translation rule in the develop branch that I still need to test a bit more.

For SnowFlake I'll need a testing server to figure out what is going on.

schuemie commented 11 months ago

Thanks to Odysseus I now have a testing server for Snowflake!

On Snowflake the issue here was that by default field names are created in uppercase, while the search for "person_id" was looking for lowercase. I implemented a simple solution in DatabaseConnector that sets Snowflake to be case-insensitive when using quoted identifiers (for the duration of the session).

So with the develop versions of SqlRender and DatabaseConnector quotes should work for BigQuery and Snowflake.

(Note that I probably will change the DatabaseConnector dbplyr implementation to also use quotes. I had not done so in the past, to avoid issues with case like in Snowflake, but now I think it better to be more consistent with how dbplyr works for other platforms)

ablack3 commented 11 months ago

Woo!

So I know we have discussed this a bit in the past and may have different opinions which is totally fine but just bringing it up again.... I would like to use DatabaseConnector with dbplyr's sql translations. So DatabaseConnector is the DBI driver and dbplyr provides the SQL translation. One reason is that I don't think the sql generated by dbplyr can be considered valid OHDSI-SQL in part due to all the quoting of identifiers. Even though OHDSI-SQL is based on MSSQL you cannot in general expect any valid MSSQL to be valid OHDSI-SQL. In Darwin we are relying on dbplyr quite heavily for SQL translation and it works well on almost all OHDSI databases (Bigquery is still a work in progress).

so in summary I want to have two separate/orthogonal "layers" to the architecture that cleanly divide responsibility: sql translation, and DBI driver.

What do you think about this approach?

If quotes are valid in OHDSI-SQL then wouldn't SqlRender need to translate those qoute symbols? Or maybe all dbms use the same quote symbols for identifiers and string literals?

schuemie commented 11 months ago

Now that we dynamically inherit the DatabaseConnector connection from the SQL Server class we could also dynamically inherit from other classes, thus invoking other dbplyr backends, avoiding the need to translate the dbplyr-generated SQL. But we'd still be 'overriding someone else's class`, so I think you still would not be happy with that?

ablack3 commented 11 months ago

Yes you're right. I think we should add some code to dbplyr that will generate the correct sql for a DatabaseConnector connection. I'm happy to take that on if you're ok with that. But yea I do think DatabaseConnector connections should have their own class and not use the microsoft sql server class. Even though you're right that dbplyr is generating sql in all normal use the sql is platform specific (already translated). Many platforms are supported including all ohdsi platforms. But it's not perfect by any stretch. I aim to find the gaps and fix them. We'll see how that goes.

Part of my interest here is to see if we can rely on these sql generation/translation for real studies and from my limited experience it is easier to write cross platform dplyr than cross platform ohdsi-sql (probably just my limited experience with ohdsi-sql though). Ideally someday we will have a function that will tell you if your ohdsi-sql is correct without needing to have test data in every supported dbms. Yesterday I wrote some ohdsi-sql but had no example data with realistic measurements to test it on.

schuemie commented 11 months ago

It would be great if dbplyr could generate actual OhdsiSql. I suspect that will be lot of work though.

I think you'll find you'll need to test your dbplyr code on all platforms too, so in that sense it is no different from OhdsiSql.

OhdsiSql is defined, just not very formally. It basically is SQL Server SQL, but restricted to the SQL functions and structures listed here. If we could find an open source SQL validator it should be easy to adapt it for OhdsiSql. But I can't seem to find any.