Open JTBrash opened 1 week ago
Hi @JTBrash,
Keep in mind that a cohort table must have four columns, cohort_definition_id, subject_id, cohort_start_date, cohort_end_date. It can also have additional columns like age_group, sex, or other variables.
The table called "test" will have a mixture of columns from two cohort tables. It will have one subject_id column but two cohort_definition_id column, two start dates, and two end dates. So it won't really be a valid cohort table.
Perhaps what you intend here is to filter one cohort to just the subjects that in another cohort. You can do this with a "semi join". Here is some example code.
# Create CDM object outside of this script
library(CDMConnector)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
con <- DBI::dbConnect(duckdb::duckdb(), eunomia_dir())
cdm <- cdm_from_con(con, "main", "main")
cohort_set <- read_cohort_set(system.file("cohorts3", package = "CDMConnector"))
cdm <- generate_cohort_set(cdm, cohort_set)
#> ℹ Generating 5 cohorts
#> ℹ Generating cohort (1/5) - gibleed_all_end_10
#> ✔ Generating cohort (1/5) - gibleed_all_end_10 [127ms]
#>
#> ℹ Generating cohort (2/5) - gibleed_all
#> ✔ Generating cohort (2/5) - gibleed_all [89ms]
#>
#> ℹ Generating cohort (3/5) - gibleed_default_with_descendants
#> ✔ Generating cohort (3/5) - gibleed_default_with_descendants [67ms]
#>
#> ℹ Generating cohort (4/5) - gibleed_default
#> ✔ Generating cohort (4/5) - gibleed_default [63ms]
#>
#> ℹ Generating cohort (5/5) - gibleed_end_10
#> ✔ Generating cohort (5/5) - gibleed_end_10 [69ms]
#>
cohortCount(cdm$cohort)
#> # A tibble: 5 × 3
#> cohort_definition_id number_records number_subjects
#> <int> <int> <int>
#> 1 1 479 479
#> 2 2 479 479
#> 3 3 479 479
#> 4 4 479 479
#> 5 5 479 479
cdm$cohort1 <- cdm$cohort %>%
filter(cohort_definition_id == 1) %>%
compute(name = "cohort1")
cdm$cohort2 <- cdm$cohort %>%
filter(cohort_definition_id == 1) %>%
compute(name = "cohort2")
cdm
#>
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: cohort, cohort1, cohort2
#> • achilles tables: -
#> • other tables: -
Notice that the inner join here will create columns ending in .x and .y so the result does not conform to the cohort table specification.
cdm$cohort1 %>%
inner_join(., cdm$cohort2, join_by(subject_id))
#> # Source: SQL [?? x 7]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.0.0:R 4.3.1//private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpdvB5WQ/file49541602d007.duckdb]
#> cohort_definition_id.x subject_id cohort_start_date.x cohort_end_date.x
#> <int> <dbl> <date> <date>
#> 1 1 32 1987-06-09 1987-06-19
#> 2 1 364 1983-08-31 1983-09-10
#> 3 1 4376 1961-02-23 1961-03-05
#> 4 1 4551 2015-12-30 2016-01-09
#> 5 1 4755 1963-11-29 1963-12-09
#> 6 1 579 1999-11-06 1999-11-16
#> 7 1 1352 1989-09-23 1989-10-03
#> 8 1 2088 1945-04-25 1945-05-05
#> 9 1 2160 1998-09-23 1998-10-03
#> 10 1 4051 2009-10-14 2009-10-24
#> # ℹ more rows
#> # ℹ 3 more variables: cohort_definition_id.y <int>, cohort_start_date.y <date>,
#> # cohort_end_date.y <date>
cdm$test <- cdm$cohort1 %>%
semi_join(., cdm$cohort2, join_by(subject_id)) %>%
compute(name = "test")
cdm$test
#> # Source: table<test> [?? x 4]
#> # Database: DuckDB v1.0.0 [root@Darwin 23.0.0:R 4.3.1//private/var/folders/xx/01v98b6546ldnm1rg1_bvk000000gn/T/RtmpdvB5WQ/file49541602d007.duckdb]
#> cohort_definition_id subject_id cohort_start_date cohort_end_date
#> <int> <dbl> <date> <date>
#> 1 1 32 1987-06-09 1987-06-19
#> 2 1 364 1983-08-31 1983-09-10
#> 3 1 4376 1961-02-23 1961-03-05
#> 4 1 4551 2015-12-30 2016-01-09
#> 5 1 4755 1963-11-29 1963-12-09
#> 6 1 579 1999-11-06 1999-11-16
#> 7 1 1352 1989-09-23 1989-10-03
#> 8 1 2088 1945-04-25 1945-05-05
#> 9 1 2160 1998-09-23 1998-10-03
#> 10 1 4051 2009-10-14 2009-10-24
#> # ℹ more rows
cdm
#>
#> ── # OMOP CDM reference (duckdb) of Synthea synthetic health database ──────────
#> • omop tables: person, observation_period, visit_occurrence, visit_detail,
#> condition_occurrence, drug_exposure, procedure_occurrence, device_exposure,
#> measurement, observation, death, note, note_nlp, specimen, fact_relationship,
#> location, care_site, provider, payer_plan_period, cost, drug_era, dose_era,
#> condition_era, metadata, cdm_source, concept, vocabulary, domain,
#> concept_class, concept_relationship, relationship, concept_synonym,
#> concept_ancestor, source_to_concept_map, drug_strength
#> • cohort tables: cohort, cohort1, cohort2, test
#> • achilles tables: -
#> • other tables: -
cdmDisconnect(cdm)
Created on 2024-07-01 with reprex v2.1.0
Created on 2024-07-01 with reprex v2.1.0 If this isn't what you're intending will you post your expected cohort table and describe a bit more the motivation behind the join.
``Trying to filter a CDM cohort table by joining to another CDM cohort. The issue is that when I join one cohort definition to the second, the class of the cohort table switches to 'other tables' when I call cdm.
The below code reproduces the error (with a cdm object created using our Snowflake database).