darwin-eu-dev / PatientProfiles

https://darwin-eu-dev.github.io/PatientProfiles/
Apache License 2.0
6 stars 5 forks source link

Should addConceptIntersect generate a cohort or subset the cdm tables with concepts of interest #551

Closed edward-burn closed 3 months ago

edward-burn commented 3 months ago

@catalamarti At the moment, as I understand it, addConceptIntersect will create a concept based cohort and then use addCohortIntersect to add the variables. However, this means that some of the results might not be how I would expect them if I didn't know this is what is going on behind the scenes

Although I think this works for addConceptIntersectFlag, I'm not sure for others. For example, addConceptIntersectCount. I would have expected this to have given a count of records that contain my concepts of interest. But instead it will give me a count of cohort entries created using my concepts, which won't necessarily be the same. For example in the case below the person has 5 drug exposures but two are given at the same time so when the cohort is made they only have four cohort records.

library(CDMConnector)
#> Warning: package 'CDMConnector' was built under R version 4.2.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.2.3
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(CodelistGenerator)
library(IncidencePrevalence)
library(PatientProfiles)

con <- DBI::dbConnect(duckdb::duckdb(),
                      dbdir = CDMConnector::eunomia_dir())
cdm <- CDMConnector::cdm_from_con(con,
                                  cdm_schem = "main",
                                  write_schema = "main")

cdm <- generateDenominatorCohortSet(cdm, name = "denominator")
#> Loading required namespace: testthat
#> ℹ Creating denominator cohorts
#> ✔ Cohorts created in 0 min and 2 sec

acetaminophen_cs <- getDrugIngredientCodes(cdm = cdm, 
                                           name = c("acetaminophen"))

cdm$denominator <- cdm$denominator %>% 
  left_join(cdm$drug_exposure %>% 
              inner_join(cdm$denominator %>% 
                           select("person_id"="subject_id") %>% 
                           distinct()) %>% 
              filter(drug_concept_id %in% !!acetaminophen_cs[[1]]) %>% 
              group_by(person_id) %>% 
              tally(name = "acetaminophen_any_time") %>% 
              rename("subject_id"="person_id"))
#> Joining with `by = join_by(person_id)`
#> Joining with `by = join_by(subject_id)`

cdm$denominator <- cdm$denominator %>% 
  addConceptIntersectCount(acetaminophen_cs, 
                           window = c(-Inf, Inf))

cdm$denominator %>% 
  filter(subject_id == 3019) %>% 
  dplyr::glimpse()
#> Rows: ??
#> Columns: 6
#> Database: DuckDB v0.9.2 [eburn@Windows 10 x64:R 4.2.1/C:\Users\eburn\AppData\Local\Temp\RtmpwZOw0B\file9407af271ee.duckdb]
#> $ cohort_definition_id      <int> 1
#> $ subject_id                <int> 3019
#> $ cohort_start_date         <date> 1977-05-04
#> $ cohort_end_date           <date> 2019-05-15
#> $ acetaminophen_any_time    <dbl> 5
#> $ acetaminophen_minf_to_inf <dbl> 4

cdm$drug_exposure %>% 
  filter(person_id == 3019,
         drug_concept_id %in% !!acetaminophen_cs[[1]]) 
#> # Source:   SQL [5 x 23]
#> # Database: DuckDB v0.9.2 [eburn@Windows 10 x64:R 4.2.1/C:\Users\eburn\AppData\Local\Temp\RtmpwZOw0B\file9407af271ee.duckdb]
#>   drug_exposure_id person_id drug_concept_id drug_exposure_start_date
#>              <int>     <int>           <int> <date>                  
#> 1            36585      3019         1127433 1991-08-19              
#> 2            36588      3019         1127433 2001-03-01              
#> 3            36589      3019         1127433 1992-04-15              
#> 4            36590      3019        40162522 1992-04-15              
#> 5            36591      3019         1127433 1995-03-01              
#> # ℹ 19 more variables: drug_exposure_start_datetime <dttm>,
#> #   drug_exposure_end_date <date>, drug_exposure_end_datetime <dttm>,
#> #   verbatim_end_date <date>, drug_type_concept_id <int>, stop_reason <chr>,
#> #   refills <int>, quantity <dbl>, days_supply <int>, sig <chr>,
#> #   route_concept_id <int>, lot_number <chr>, provider_id <int>,
#> #   visit_occurrence_id <int>, visit_detail_id <int>, drug_source_value <chr>,
#> #   drug_source_concept_id <int>, route_source_value <chr>, …

Created on 2024-03-19 by the reprex package (v2.0.1)

edward-burn commented 3 months ago

I'm wondering why use a cohort rather than using table intersect, subsetting tables with the concepts of interest? I think the behaviour from the latter would be more in line with what at least I would imagine (giving a count of 5 rather than 4 in the case above)

catalamarti commented 3 months ago

happy with that :)

catalamarti commented 3 months ago

575