OHDSI / Eunomia

An R package that facilitates access to a variety of OMOP CDM sample data sets.
https://ohdsi.github.io/Eunomia/
42 stars 11 forks source link

Option to use duckdb or SQLite #28

Closed ablack3 closed 1 year ago

ablack3 commented 2 years ago

This is just and idea I'd like some feedback on to see if it is worth implementing.

The OMOP CDM has a lot of date columns. Sqlite does not have a date type and thus also does not support date manipulation functions used in OHDSI-SQL. SqlRender manages to work around this limitation to a large extent but there tends to be edge cases where the limitations of the lack of a date type cause frustration for users.

It is not very difficult to find examples of valid date manipulation OHDSI-SQL that produces incorrect results on Sqlite.

library(Eunomia)
#> Loading required package: DatabaseConnector

cd <- getEunomiaConnectionDetails()
con <- connect(cd)
#> Connecting using SQLite driver

# valid OHDSI-SQL that produces an incorrect result
renderTranslateQuerySql(con, "select eomonth(condition_start_date) as eomonth from main.condition_occurrence limit 10")
#>       EOMONTH
#> 1  1446249600
#> 2  1320019200
#> 3   446860800
#> 4  1133308800
#> 5   144460800
#> 6   675648000
#> 7   307497600
#> 8   933379200
#> 9   654652800
#> 10  509932800

# I'm assuming the expected result of eomonth is a date not an integer
# The return type of an OHDSI SQL function should be the same across database platforms right?
disconnect(con)

Created on 2022-08-18 with reprex v2.0.2

duckdb is a file based database like SQLite that has a date type and supporting date manipulation functions. Is there interested in using duckdb with Eunomia in place of Sqlite?

Tagging @edward-burn since I think he has also experienced issues with dates in Eunomia.

duckdb was recently added as a supported SQL dialect in the development branch of SqlRender.

edward-burn commented 2 years ago

+1 for using duckdb for demonstration and testing purposes as it seems to deal with dates much better than SQLite (with the performance drawback of duckdb not such a problem when working with a small dataset).

Although very similar to set up as SQLite, for Eunomia this might mean changes (for example with duckdb, I think https://github.com/OHDSI/Eunomia/blob/52d61768e31de9f1d62a7680962bce97a8a3fa71/tests/testthat/test-basic.R#L19 would I think change to "SELECT COUNT(*) FROM person;". This might the cause some headaches for users of Eunomia, as for example could potentially break a test like this one https://github.com/OHDSI/FeatureExtraction/blob/bddb9ca9ce946a540b04e7bfa0a2465344b7b249/tests/testthat/test-Table1.R#L24

edward-burn commented 2 years ago

If using duckdb, perhaps the change could be linked with work on this issue: https://github.com/OHDSI/Eunomia/issues/23. If a function to download the data was being used, perhaps this could give the user the option to download data as either SQLite or duckdb?

fdefalco commented 2 years ago

I'm open to trying duckdb. I agree with @edward-burn that combining with that issue might make sense if we are going to provide multiple sample data sets it could be in duckdb or it could be gzip csv which then load into duckdb as it seems to support that although I've had no luck importing the csv the eunomia drops into duckdb. Its auto csv read isn't inferring the columns well enough to work. Possibly I'm doing it wrong too as I've never used duckdb before. Here is what I tried:

library("DBI")
connection = DBI::dbConnect(duckdb::duckdb(), dbdir=":memory:")

library("Eunomia")
Eunomia::exportToCsv("D:/OHDSI/EunomiaCsv")

DBI::dbExecute(connection, "CREATE TABLE PERSON AS SELECT * FROM read_csv_auto('D:/OHDSI/EunomiaCsv/PERSON.csv');")

Unfortunately, that didn't work. The other test we can do is to use ETL-Synthea to build directly with duckdb instead of my typical postgres setup. I won't have time to try that until next week at the earliest, but if someone else wants to give it a go feel free.

ablack3 commented 2 years ago

I'll give it a try and create a draft PR for review.

edward-burn commented 2 years ago

@fdefalco I think going via readr does a better a job of inferring columns

DBI::dbWithTransaction(connection, {
  DBI::dbWriteTable(connection, 
                    "person",
                    readr::read_csv("D:/OHDSI/EunomiaCsv/PERSON.csv",
                                    col_types = cols()),
                    overwrite = TRUE
  )
})
dplyr::tbl(connection, "person")

imagen

fdefalco commented 2 years ago

That's great, how would we do the following?

DatabaseConnector::querySql(connection,"Select * from PERSON")
fdefalco commented 2 years ago

I created the following gist:

https://gist.github.com/fdefalco/288a7a722392dabdd94925a56239d2ca

ablack3 commented 1 year ago

Just updating that Ed and I have been using a Eunomia duckdb version and the date type support is really nice. SQLRender now supports it. Not DatabaseConnector yet though so we are querying it using DBI. However the one pain point has been that the duckdb file format changes with each release until they reach version 1.0 so the duckdb dataset I created will only work with a specific version of duckdb (the latest v0.6). Once duckdb gets to v1.0 the format will be stable though.

ablack3 commented 1 year ago

That's great, how would we do the following?

DatabaseConnector::querySql(connection,"Select * from PERSON")

Should be possible with this PR. https://github.com/OHDSI/DatabaseConnector/pull/213

library(DatabaseConnector)
conn <- DBI::dbConnect(duckdb::duckdb(), ":memory:")
DBI::dbWriteTable(conn, "cars", cars)
df <- querySql(conn, "select * from cars")
head(df)
#>   SPEED DIST
#> 1     4    2
#> 2     4   10
#> 3     7    4
#> 4     7   22
#> 5     8   16
#> 6     9   10

executeSql(conn, "
    CREATE TABLE test (column1 integer, column2 varchar);
    INSERT INTO test VALUES (1, 'a');
    INSERT INTO test VALUES (2, 'b');
  ")
#>   |                                                                              |                                                                      |   0%  |                                                                              |=======================                                               |  33%  |                                                                              |===============================================                       |  67%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00265 secs

df <- querySql(conn, "select * from test")
head(df)
#>   COLUMN1 COLUMN2
#> 1       1       a
#> 2       2       b

DBI::dbDisconnect(conn)

Created on 2022-11-29 with reprex v2.0.2