darwin-eu / CDMConnector

A pipe friendly way to interact with an OMOP Common Data Model
https://darwin-eu.github.io/CDMConnector/
Apache License 2.0
13 stars 9 forks source link

Modifying cohort table with a join changes the cohort table class #21

Open JTBrash opened 1 week ago

JTBrash commented 1 week ago

``Trying to filter a CDM cohort table by joining to another CDM cohort. The issue is that when I join one cohort definition to the second, the class of the cohort table switches to 'other tables' when I call cdm.

The below code reproduces the error (with a cdm object created using our Snowflake database).

# Create CDM object outside of this script
cohort_set <- read_cohort_set(system.file("cohorts2", package = "CDMConnector"))

cdm <- generate_cohort_set(cdm, cohort_set)
cdm$cohort1 <- cdm$cohort %>% filter(cohort_definition_id == 1) %>% compute()
cdm$cohorta <- cdm$cohort %>% filter(cohort_definition_id == 1) %>% compute()

cdm$test <- cdm$cohort1 %>% inner_join(., cdm$cohorta, join_by(subject_id))

cdm
class(cdm$test)
ablack3 commented 6 days ago

Hi @JTBrash,

Keep in mind that a cohort table must have four columns, cohort_definition_id, subject_id, cohort_start_date, cohort_end_date. It can also have additional columns like age_group, sex, or other variables.

The table called "test" will have a mixture of columns from two cohort tables. It will have one subject_id column but two cohort_definition_id column, two start dates, and two end dates. So it won't really be a valid cohort table.

Perhaps what you intend here is to filter one cohort to just the subjects that in another cohort. You can do this with a "semi join". Here is some example code.

# Create CDM object outside of this script
library(CDMConnector)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, "main", "main")

cohort_set <- read_cohort_set(system.file("cohorts3", package = "CDMConnector"))

cdm <- generate_cohort_set(cdm, cohort_set)
#> ℹ Generating 5 cohorts
#> ℹ Generating cohort (1/5) - gibleed_all_end_10
#> ✔ Generating cohort (1/5) - gibleed_all_end_10 [127ms]
#> 
#> ℹ Generating cohort (2/5) - gibleed_all
#> ✔ Generating cohort (2/5) - gibleed_all [89ms]
#> 
#> ℹ Generating cohort (3/5) - gibleed_default_with_descendants
#> ✔ Generating cohort (3/5) - gibleed_default_with_descendants [67ms]
#> 
#> ℹ Generating cohort (4/5) - gibleed_default
#> ✔ Generating cohort (4/5) - gibleed_default [63ms]
#> 
#> ℹ Generating cohort (5/5) - gibleed_end_10
#> ✔ Generating cohort (5/5) - gibleed_end_10 [69ms]
#> 

cohortCount(cdm$cohort)
#> # A tibble: 5 × 3
#>   cohort_definition_id number_records number_subjects
#>                  <int>          <int>           <int>
#> 1                    1            479             479
#> 2                    2            479             479
#> 3                    3            479             479
#> 4                    4            479             479
#> 5                    5            479             479

cdm$cohort1 <- cdm$cohort %>% 
  filter(cohort_definition_id == 1) %>% 
  compute(name = "cohort1")

cdm$cohort2 <- cdm$cohort %>% 
  filter(cohort_definition_id == 1) %>% 
  compute(name = "cohort2")

cdm
#> 
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: cohort, cohort1, cohort2
#> • achilles tables: -
#> • other tables: -

Notice that the inner join here will create columns ending in .x and .y so the result does not conform to the cohort table specification.


cdm$cohort1 %>% 
  inner_join(., cdm$cohort2, join_by(subject_id))
#> # Source:   SQL [?? x 7]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.0.0:R 4.3.1//private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpdvB5WQ/file49541602d007.duckdb]
#>    cohort_definition_id.x subject_id cohort_start_date.x cohort_end_date.x
#>                     <int>      <dbl> <date>              <date>           
#>  1                      1         32 1987-06-09          1987-06-19       
#>  2                      1        364 1983-08-31          1983-09-10       
#>  3                      1       4376 1961-02-23          1961-03-05       
#>  4                      1       4551 2015-12-30          2016-01-09       
#>  5                      1       4755 1963-11-29          1963-12-09       
#>  6                      1        579 1999-11-06          1999-11-16       
#>  7                      1       1352 1989-09-23          1989-10-03       
#>  8                      1       2088 1945-04-25          1945-05-05       
#>  9                      1       2160 1998-09-23          1998-10-03       
#> 10                      1       4051 2009-10-14          2009-10-24       
#> # ℹ more rows
#> # ℹ 3 more variables: cohort_definition_id.y <int>, cohort_start_date.y <date>,
#> #   cohort_end_date.y <date>

cdm$test <- cdm$cohort1 %>% 
  semi_join(., cdm$cohort2, join_by(subject_id)) %>% 
  compute(name = "test")

cdm$test
#> # Source:   table<test> [?? x 4]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.0.0:R 4.3.1//private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpdvB5WQ/file49541602d007.duckdb]
#>    cohort_definition_id subject_id cohort_start_date cohort_end_date
#>                   <int>      <dbl> <date>            <date>         
#>  1                    1         32 1987-06-09        1987-06-19     
#>  2                    1        364 1983-08-31        1983-09-10     
#>  3                    1       4376 1961-02-23        1961-03-05     
#>  4                    1       4551 2015-12-30        2016-01-09     
#>  5                    1       4755 1963-11-29        1963-12-09     
#>  6                    1        579 1999-11-06        1999-11-16     
#>  7                    1       1352 1989-09-23        1989-10-03     
#>  8                    1       2088 1945-04-25        1945-05-05     
#>  9                    1       2160 1998-09-23        1998-10-03     
#> 10                    1       4051 2009-10-14        2009-10-24     
#> # ℹ more rows

cdm 
#> 
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: cohort, cohort1, cohort2, test
#> • achilles tables: -
#> • other tables: -

cdmDisconnect(cdm)

Created on 2024-07-01 with reprex v2.1.0

Created on 2024-07-01 with reprex v2.1.0 If this isn't what you're intending will you post your expected cohort table and describe a bit more the motivation behind the join.