OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 9 forks source link

Extract contents of remote CDM into Andromeda and run achilles on the data in Andromeda #23

Closed ablack3 closed 2 years ago

ablack3 commented 3 years ago

This R script is a proof of concept for two pieces of functionality that Andromeda might support in the future.

  1. Extract a subset of the CDM or the full CDM into an Andromeda object
  2. Execute Achilles on the local CDM subset in the Andromeda object
library(Andromeda)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# setup connection to a remote CDM
# For this example we will use Eunomia to simulate a remote CDM
remoteCdmConnectionDetails <- Eunomia::getEunomiaConnectionDetails()
remoteConnection <- DBI::dbConnect(RSQLite::SQLite(), remoteCdmConnectionDetails$server())

# create a 'dm' object that references the remote CDM
tablesToCopy <- DBI::dbListTables(remoteConnection)
remoteCdm <- dm::dm_from_src(remoteConnection, table_names = tablesToCopy)
#> Keys could not be queried, use `learn_keys = FALSE` to mute this message.

# create a new andromeda object that will recieve the remote data
localCdm <- andromeda()
localConnection <- DBI::dbConnect(RSQLite::SQLite(), localCdm@dbname)

# Copy all of tables referenced by the dm object into the andromeda object
# We should have the ability to subset to specific cohorts here
dm::copy_dm_to(localConnection, remoteCdm, temporary = FALSE)

# Save the local andromeda object but keep the connection open
saveAndromeda(localCdm, "localCdm.zip", maintainConnection = TRUE)

# Check to make sure all of the tables we want to be copied were copied
DBI::dbListTables(localConnection)
#>  [1] "CARE_SITE"             "CDM_SOURCE"            "COHORT"               
#>  [4] "COHORT_ATTRIBUTE"      "CONCEPT"               "CONCEPT_ANCESTOR"     
#>  [7] "CONCEPT_CLASS"         "CONCEPT_RELATIONSHIP"  "CONCEPT_SYNONYM"      
#> [10] "CONDITION_ERA"         "CONDITION_OCCURRENCE"  "COST"                 
#> [13] "DEATH"                 "DEVICE_EXPOSURE"       "DOMAIN"               
#> [16] "DOSE_ERA"              "DRUG_ERA"              "DRUG_EXPOSURE"        
#> [19] "DRUG_STRENGTH"         "FACT_RELATIONSHIP"     "LOCATION"             
#> [22] "MEASUREMENT"           "METADATA"              "NOTE"                 
#> [25] "NOTE_NLP"              "OBSERVATION"           "OBSERVATION_PERIOD"   
#> [28] "PAYER_PLAN_PERIOD"     "PERSON"                "PROCEDURE_OCCURRENCE" 
#> [31] "PROVIDER"              "RELATIONSHIP"          "SOURCE_TO_CONCEPT_MAP"
#> [34] "SPECIMEN"              "VISIT_DETAIL"          "VISIT_OCCURRENCE"     
#> [37] "VOCABULARY"            "sqlite_stat1"          "sqlite_stat4"

# Set up a connection details object that points to the sqlite database underlying Andromeda
# Perhaps Andromeda can help with this in the future
connectionDetails <- DatabaseConnector::createConnectionDetails("sqlite", server = localCdm@dbname)

# Run Achilles - use only a few analyses for this example
Achilles::achilles(connectionDetails = connectionDetails,
                   cdmDatabaseSchema = "main",
                   analysisIds = 0:3)
#> Connecting using SQLite driver
#> Connecting using SQLite driver
#> Connecting using SQLite driver
#> 2021-07-09 12:17:16  Beginning single-threaded execution
#> Connecting using SQLite driver
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00887 secs
#> Warning: The 'oracleTempSchema' argument is deprecated. Use 'tempEmulationSchema' instead.
#> This warning is displayed once every 8 hours.
#> 2021-07-09 12:17:16  Executing multiple queries. This could take a while
#> 2021-07-09 12:17:16  Analysis 0 (Source name) -- START
#>   |                                                                              |                                                                      |   0%  |                                                                              |==================                                                    |  25%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================================                  |  75%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0 secs
#> 2021-07-09 12:17:16  [Main Analysis] [COMPLETE] 0 (0.015632 secs)
#> 2021-07-09 12:17:16  Analysis 1 (Number of persons) -- START
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0125 secs
#> 2021-07-09 12:17:16  [Main Analysis] [COMPLETE] 1 (0.012519 secs)
#> 2021-07-09 12:17:16  Analysis 2 (Number of persons by gender) -- START
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0 secs
#> 2021-07-09 12:17:16  [Main Analysis] [COMPLETE] 2 (0.000000 secs)
#> 2021-07-09 12:17:16  Analysis 3 (Number of persons by year of birth) -- START
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.0157 secs
#> 2021-07-09 12:17:16  [Main Analysis] [COMPLETE] 3 (0.015662 secs)
#> 2021-07-09 12:17:17  Merging scratch Achilles tables
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00792 secs
#>   |                                                                              |                                                                      |   0%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.00629 secs
#> 2021-07-09 12:17:17  Done. Achilles results can now be found in schema main
#> Connecting using SQLite driver
#>   |                                                                              |                                                                      |   0%  |                                                                              |                                                                      |   0%  |                                                                              |=========                                                             |  12%  |                                                                              |==================                                                    |  25%  |                                                                              |==========================                                            |  38%  |                                                                              |===================================                                   |  50%  |                                                                              |============================================                          |  62%  |                                                                              |====================================================                  |  75%  |                                                                              |=============================================================         |  88%  |                                                                              |======================================================================| 100%
#> Executing SQL took 0.028 secs

# View the Achilles results
DBI::dbGetQuery(localConnection, "select * from main.achilles_results limit 10")
#>    analysis_id stratum_1 stratum_2  stratum_3 stratum_4 stratum_5 count_value
#> 1            0               1.6.7 1625847436      <NA>      <NA>        2694
#> 2      2000000 0.02 secs      <NA>       <NA>      <NA>      <NA>           6
#> 3            1      <NA>      <NA>       <NA>      <NA>      <NA>        2694
#> 4      2000001 0.01 secs      <NA>       <NA>      <NA>      <NA>           6
#> 5            2    8507.0      <NA>       <NA>      <NA>      <NA>        1321
#> 6            2    8532.0      <NA>       <NA>      <NA>      <NA>        1373
#> 7      2000002    0 secs      <NA>       <NA>      <NA>      <NA>           6
#> 8            3    1909.0      <NA>       <NA>      <NA>      <NA>          19
#> 9            3    1910.0      <NA>       <NA>      <NA>      <NA>           6
#> 10           3    1911.0      <NA>       <NA>      <NA>      <NA>          14

# Save the local Andromeda object
saveAndromeda(localCdm, "localCdm.zip")
#> Disconnected Andromeda. This data object can no longer be used

DBI::dbDisconnect(remoteConnection)
DBI::dbDisconnect(localConnection)

Created on 2021-07-09 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.5 (2021-03-31) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2021-07-09 #> #> - Packages ------------------------------------------------------------------- #> ! package * version date lib source #> Achilles 1.6.7 2021-07-09 [1] Github (OHDSI/Achilles@1db677b) #> Andromeda * 0.5.0 2021-07-02 [1] CRAN (R 4.0.5) #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.5) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3) #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.5) #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.5) #> blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.5) #> cachem 1.0.5 2021-05-15 [1] CRAN (R 4.0.5) #> cli 3.0.0 2021-06-30 [1] CRAN (R 4.0.5) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.5) #> DatabaseConnector 4.0.2 2021-04-15 [1] CRAN (R 4.0.5) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.5) #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.0.5) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.5) #> dm 0.2.3 2021-06-20 [1] CRAN (R 4.0.5) #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.5) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.5) #> Eunomia 1.0.1 2020-11-04 [1] CRAN (R 4.0.5) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.5) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.5) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.5) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.5) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.5) #> highr 0.9 2021-04-16 [1] CRAN (R 4.0.5) #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.0.5) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.5) #> igraph 1.2.6 2020-10-06 [1] CRAN (R 4.0.5) #> knitr 1.33 2021-04-24 [1] CRAN (R 4.0.5) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.5) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.5) #> memoise 2.0.0 2021-01-26 [1] CRAN (R 4.0.5) #> ParallelLogger 2.0.1 2020-10-26 [1] CRAN (R 4.0.5) #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.0.5) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.5) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.5) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.5) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.5) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5) #> D rJava 1.0-4 2021-04-29 [1] CRAN (R 4.0.5) #> rjson 0.2.20 2018-06-08 [1] CRAN (R 4.0.3) #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.5) #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.0.5) #> RSQLite 2.2.7 2021-04-22 [1] CRAN (R 4.0.5) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.5) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.5) #> SqlRender 1.7.0 2021-03-02 [1] CRAN (R 4.0.5) #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.5) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.5) #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.0.5) #> tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.5) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.5) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.5) #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5) #> xfun 0.24 2021-06-15 [1] CRAN (R 4.0.5) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.5) #> zip 2.2.0 2021-05-31 [1] CRAN (R 4.0.5) #> #> [1] C:/Users/adam.DESKTOP-D3KQQA1/Documents/R/win-library/4.0 #> [2] C:/Program Files/R/R-4.0.5/library #> #> D -- DLL MD5 mismatch, broken installation. ```

This is quite clunky since it requires multiple connections and different types of objects to be created.

vojtechhuser commented 3 years ago

this is great. it gives connectionDetails for andromeda object. Very helpfull. Thank you. Maybe put into vignette as well. (not really an issue (problem)

schuemie commented 2 years ago

Why not use RSQLite for this? Andromeda is specifically aimed at storing large data objects as if they're in memory (e.g. changes are only persisted if you save the whole Andromeda object to file). RSQLite is a local database, already supported by DatabaseConnector and SqlRender.

ablack3 commented 2 years ago

Well yes it makes sense to use RSQLite or duckdb. I do like the dplyr based interface to a database provided by the dm package though and having an on disk, OMOP CDM aware database object that plays well with dplyr is what I was thinking Andromeda could become. By OMOP CDM aware I mean an object that has special methods that make querying a cdm very easy similar to the inspect_omop python package. However I get your point that the purpose of andromeda is more in line with ff and the use of a SQL database is an incidental implementation detail. RSQLite or duckdb combined with dm is probably a better fit for this idea. This idea is also getting at the 'cdm' level abstraction in R we've talked about a little bit (i.e. an object that represents a single CDM).

schuemie commented 2 years ago

I agree that the ability to use dplyr is nice. Integrating dplyr with DatabaseConnector, which would allow this for all database platforms, has been on my wishlist for years. I just never had the time to get that working.

ablack3 commented 2 years ago

I think this would be a good quest for me.