OHDSI / Andromeda

AsynchroNous Disk-based Representation of MassivE DAta: An R package aimed at replacing ff for storing large data objects.
https://ohdsi.github.io/Andromeda/
11 stars 10 forks source link

Re-export functions from the dm (data models) package #13

Closed ablack3 closed 2 years ago

ablack3 commented 3 years ago

The dm package provides a unified interface to relational data models in R (I think in an analogous way to DBI providing a unified interface to databases). Investigate the possibility integrating the dm package into Andromeda.

ablack3 commented 2 years ago

An example of a possible "10 minute rule" example with the dm package. (Data to something interesting in 10 minutes or less)

library(dm)
library(dplyr)
library(ggplot2)

# Using the RPostgres driver backend
con <- DBI::dbConnect(RPostgres::Postgres(),
                      dbname = "covid",
                      user = "postgres",
                      password = "",
                      port = 5432,
                      host = "localhost")

# create a cdm reference using the dm package
cdm <- dm_from_src(con, schema = "cdm5")
#> Keys queried successfully, use `learn_keys = TRUE` to mute this message.

# The object can be used like a list of table references
cdm$condition_occurrence
#> # Source:   table<"cdm5"."condition_occurrence"> [?? x 15]
#> # Database: postgres [postgres@localhost:5432/covid]
#>    condition_occurrence_id person_id condition_concept_id condition_start_date
#>                    <int64>   <int64>                <int> <date>              
#>  1                       1         1               437663 2020-03-06          
#>  2                       2         1               378253 2020-03-06          
#>  3                       3         1               254761 2020-03-06          
#>  4                       4         1              4223659 2020-03-06          
#>  5                       5         1             37311061 2020-03-06          
#>  6                       6         1              4289517 2020-03-06          
#>  7                       7         2             37311061 2020-03-08          
#>  8                       8         2               434490 2020-03-08          
#>  9                       9         2               437663 2020-03-08          
#> 10                      10         3               437390 2020-02-12          
#> # … with more rows, and 11 more variables: condition_start_datetime <dttm>,
#> #   condition_end_date <date>, condition_end_datetime <dttm>,
#> #   condition_type_concept_id <int>, stop_reason <chr>, provider_id <int64>,
#> #   visit_occurrence_id <int64>, condition_source_value <chr>,
#> #   condition_source_concept_id <int>, condition_status_source_value <chr>,
#> #   condition_status_concept_id <int>

# We can use this along with dplyr/tidyverse to go from data to result very quickly
cdm$concept %>% 
  filter(concept_id == 437663) %>% # fever
  inner_join(cdm$concept_ancestor, by = c("concept_id" = "ancestor_concept_id")) %>% # add descendants
  inner_join(cdm$condition_occurrence, by = c("concept_id" = "condition_concept_id")) %>% 
  inner_join(cdm$person, by = "person_id") %>% 
  distinct(person_id, year_of_birth) %>% 
  count(year_of_birth) %>% # aggregation occurs in the database
  collect() %>% # bring counts into R
  mutate_all(as.integer) %>% # convert int64 to integer
  ggplot(aes(year_of_birth, n)) +
  geom_bar(stat = "identity") +
  ggtitle("Age distribution of patients who have ever had a fever")

Created on 2021-12-21 by the reprex package (v2.0.1)

ablack3 commented 2 years ago

I explored implementing Andromeda as an extension of dm objects. There are a lot of overlaps between dm and Andromeda. Both are essentially a list of references to tables in a database. However there are some key differences.

con <- duckdb::dbConnect(duckdb::duckdb(), dbdir = tempfile(fileext = ".duckdb"))
a <- dm::dm_from_src(src = con, learn_keys = FALSE)
a <- dm::dm_add_tbl(a, cars = cars)
DBI::dbListTables(con)
#> character(0)
# cars is not copied to the database

# Explicity copy the data first
DBI::dbWriteTable(con, "cars_db", cars)
a <- dm::dm_from_src(src = con, learn_keys = FALSE)
a
#> ── Table source ────────────────────────────────────────────────────────────────
#> src:  duckdb_connection
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `cars_db`
#> Columns: 2
#> Primary keys: 0
#> Foreign keys: 0

DBI::dbWriteTable(con, "cars_db3", cars)
a <- dm::dm_add_tbl(a, cars3 = dplyr::tbl(con, "cars_db3"))
a 
#> ── Table source ────────────────────────────────────────────────────────────────
#> src:  duckdb_connection
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `cars_db`, `cars3`
#> Columns: 4
#> Primary keys: 0
#> Foreign keys: 0

Created on 2021-12-29 by the reprex package (v2.0.1)

ablack3 commented 2 years ago

I committed my work on this to the dm branch. It seems like Andromeda could be based on dm. After talking with Martijn though I think Andromeda's main goal is to replace ff for larger than memory dataframes and not really to be treated as a relational database even though it is implemented using a database.