Closed ablack3 closed 2 years ago
An example of a possible "10 minute rule" example with the dm package. (Data to something interesting in 10 minutes or less)
library(dm)
library(dplyr)
library(ggplot2)
# Using the RPostgres driver backend
con <- DBI::dbConnect(RPostgres::Postgres(),
dbname = "covid",
user = "postgres",
password = "",
port = 5432,
host = "localhost")
# create a cdm reference using the dm package
cdm <- dm_from_src(con, schema = "cdm5")
#> Keys queried successfully, use `learn_keys = TRUE` to mute this message.
# The object can be used like a list of table references
cdm$condition_occurrence
#> # Source: table<"cdm5"."condition_occurrence"> [?? x 15]
#> # Database: postgres [postgres@localhost:5432/covid]
#> condition_occurrence_id person_id condition_concept_id condition_start_date
#> <int64> <int64> <int> <date>
#> 1 1 1 437663 2020-03-06
#> 2 2 1 378253 2020-03-06
#> 3 3 1 254761 2020-03-06
#> 4 4 1 4223659 2020-03-06
#> 5 5 1 37311061 2020-03-06
#> 6 6 1 4289517 2020-03-06
#> 7 7 2 37311061 2020-03-08
#> 8 8 2 434490 2020-03-08
#> 9 9 2 437663 2020-03-08
#> 10 10 3 437390 2020-02-12
#> # … with more rows, and 11 more variables: condition_start_datetime <dttm>,
#> # condition_end_date <date>, condition_end_datetime <dttm>,
#> # condition_type_concept_id <int>, stop_reason <chr>, provider_id <int64>,
#> # visit_occurrence_id <int64>, condition_source_value <chr>,
#> # condition_source_concept_id <int>, condition_status_source_value <chr>,
#> # condition_status_concept_id <int>
# We can use this along with dplyr/tidyverse to go from data to result very quickly
cdm$concept %>%
filter(concept_id == 437663) %>% # fever
inner_join(cdm$concept_ancestor, by = c("concept_id" = "ancestor_concept_id")) %>% # add descendants
inner_join(cdm$condition_occurrence, by = c("concept_id" = "condition_concept_id")) %>%
inner_join(cdm$person, by = "person_id") %>%
distinct(person_id, year_of_birth) %>%
count(year_of_birth) %>% # aggregation occurs in the database
collect() %>% # bring counts into R
mutate_all(as.integer) %>% # convert int64 to integer
ggplot(aes(year_of_birth, n)) +
geom_bar(stat = "identity") +
ggtitle("Age distribution of patients who have ever had a fever")
Created on 2021-12-21 by the reprex package (v2.0.1)
I explored implementing Andromeda as an extension of dm objects. There are a lot of overlaps between dm and Andromeda. Both are essentially a list of references to tables in a database. However there are some key differences.
con <- duckdb::dbConnect(duckdb::duckdb(), dbdir = tempfile(fileext = ".duckdb"))
a <- dm::dm_from_src(src = con, learn_keys = FALSE)
a <- dm::dm_add_tbl(a, cars = cars)
DBI::dbListTables(con)
#> character(0)
# cars is not copied to the database
# Explicity copy the data first
DBI::dbWriteTable(con, "cars_db", cars)
a <- dm::dm_from_src(src = con, learn_keys = FALSE)
a
#> ── Table source ────────────────────────────────────────────────────────────────
#> src: duckdb_connection
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `cars_db`
#> Columns: 2
#> Primary keys: 0
#> Foreign keys: 0
DBI::dbWriteTable(con, "cars_db3", cars)
a <- dm::dm_add_tbl(a, cars3 = dplyr::tbl(con, "cars_db3"))
a
#> ── Table source ────────────────────────────────────────────────────────────────
#> src: duckdb_connection
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `cars_db`, `cars3`
#> Columns: 4
#> Primary keys: 0
#> Foreign keys: 0
Created on 2021-12-29 by the reprex package (v2.0.1)
I committed my work on this to the dm branch. It seems like Andromeda could be based on dm. After talking with Martijn though I think Andromeda's main goal is to replace ff for larger than memory dataframes and not really to be treated as a relational database even though it is implemented using a database.
The dm package provides a unified interface to relational data models in R (I think in an analogous way to DBI providing a unified interface to databases). Investigate the possibility integrating the dm package into Andromeda.