OHDSI / CohortGenerator

An R package for instantiating cohorts using data in the CDM.
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

Cohort templates #134

Open azimov opened 3 months ago

azimov commented 3 months ago

See issue #133

TODO: *Vignette on implementation (will wait for agreement on approach before writing)

azimov commented 3 months ago

@mdlavallee92 I would appreciate your input here, from the perspective of Capr which could be used to generate templates for broad, highly sensitive phenotypes.

mdlavallee92 commented 2 months ago

Thanks for tagging me to the review @azimov. In terms of a code review, I will admit there is not much I can contribute since this falls outside of the Capr realm and my grasp of R6 is limited. Most of my comments are philosophical wrt the purpose of the tool.

Conceptual Understanding If I understand this correctly, essentially this would be like helpers to instantiate a set of cohorts based on a vocabulary (i.e. all rxNorm Ingredients or all ATC 4th class). So, the workflow would be like build Circe study pop plus all drug ingredient cohorts in the database based on rxNorm (for example). What is the motivation for templating out ingredient cohorts? Are you using them for characterization or modelling features or both? Also how does this differentiate from using either of the eras tables in the cdm? If characterizing, an approach would be to use concept sets instead of cohorts.

Efficiency Using the vocabularies to build simple cohorts up front is a far more efficient than using Capr (since I am not building all these vocab based cohorts on 'n' CIRCE sql). I think it would be something I would use.

Other templates My other comment is towards intake of other cohort templates. Right now, you have three examples on clinical domains. A common one I encounter at my company are observation period cohorts for denominators in a basic prevalence calculation (say all males on year 2010 of database). Using non-circe variations of these cohorts would be very handy and also de-clutter an ATLAS instance 😏. Could you provide a road map for how to incorporate other cohort templates?

Non-CIRCE workflow Final comment is more general. I am softening my stance towards using non-Circe cohorts, particularly in internal studies. While my preferred approach is to always be in CIRCE for its reproducibility attributes, there are many times I have to build non-trivial subsets on say measurements or a cdm domain not covered in circe (ie. episode or cost). Generally, my approach for an OHDSI study is that the study population (target) should be in Circe because it is something I can reproduce across different studies, iterate on changes with colleagues via ATLAS and I have an auditable file via json. While in an internal study in theory you can approach the CDM in any way you want, a more principled approach is to treat any study as if it were a network study. Having said this it becomes a real headache to keep track of all these non-CIRCE cohorts. The best I can come up with is do custom sql magic and then reroute this work into the nx4 cohort table structure to access this cohort (subset or ow) later on in my study. So, I appreciate ways to standardize some of these non-Circe routes of cohort building as you have done here.

azimov commented 2 months ago

My other comment is towards intake of other cohort templates. Right now, you have three examples on clinical domains. A common one I encounter at my company are observation period cohorts for denominators in a basic prevalence calculation (say all males on year 2010 of database). Using non-circe variations of these cohorts would be very handy and also de-clutter an ATLAS instance 😏. Could you provide a road map for how to incorporate other cohort templates?

I would say that this is already possible with subset operations included in this repository (saving, loading and using in strategus designs is already well supported too).

For your example question:

library(CohortGenerator)

cohortDefinitionSet <- ROhdsiWebApi::getMyCohorts(...) # could be the target cohorts in this PR

subsetDefinition <- createSubsetDefinition(
    name = "MALES, in 2010",
    definitionId = 1,
    subsetOperations = list(
       # Applied in order
       createLimitSubset(
           name = "Year 2010",
           calendarStartDate = "2010/1/1",
           calendarEndDate = "2010/12/31"
       ),
       createDemographicSubset(
          name = ">=18",
          ageMin = 18
        )
    )
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortSubsetDefinition(subsetDef, targetCohortIds = c(1,2,3,4)) # If not specified, it will apply to all

generateCohortDefinitionSet(...)

This will save you a lot of time and keep clutter out of atlas.

azimov commented 2 months ago

Conceptual Understanding If I understand this correctly, essentially this would be like helpers to instantiate a set of cohorts based on a vocabulary (i.e. all rxNorm Ingredients or all ATC 4th class). So, the workflow would be like build Circe study pop plus all drug ingredient cohorts in the database based on rxNorm (for example).

I definitely see the benefits of getting characteristics for concept sets, but the motivation here is really to have cohorts generated from pure SQL to be treated in the same manner as all other cohorts in packages. At the moment we are doing sql in a lot of places anyway. Personally I have used a lot of hacks to get this done and I want an approach that is cleaner and re-usable in any tools.

What is the motivation for templating out ingredient cohorts?

We are using these in our comparator selection tool and other places to do things at scale. The logic is a bit more than the drug eras table (for example, requiring prior observation).

Using the vocabularies to build simple cohorts up front is a far more efficient than using Capr (since I am not building all these vocab based cohorts on 'n' CIRCE sql). I think it would be something I would use.

This is good to know. In a dream world I would like to go from a templated cohort to a circe/capr compatable definition with a work flow being like:

1. Template a bunch of sensitive cohorts
2. Evaluate which is best, propose potential changes
3. Extract circe definition
4. Add extra features (e.g. relevant measurement, procedures or other codes unrelated to primary diagnosis)
5. Publish to atlas as phenotype
mdlavallee92 commented 2 months ago

My other comment is towards intake of other cohort templates. Right now, you have three examples on clinical domains. A common one I encounter at my company are observation period cohorts for denominators in a basic prevalence calculation (say all males on year 2010 of database). Using non-circe variations of these cohorts would be very handy and also de-clutter an ATLAS instance 😏. Could you provide a road map for how to incorporate other cohort templates?

I would say that this is already possible with subset operations included in this repository (saving, loading and using in strategus designs is already well supported too).

For your example question:

library(CohortGenerator)

cohortDefinitionSet <- ROhdsiWebApi::getMyCohorts(...) # could be the target cohorts in this PR

subsetDefinition <- createSubsetDefinition(
    name = "MALES, in 2010",
    definitionId = 1,
    subsetOperations = list(
       # Applied in order
       createLimitSubset(
           name = "Year 2010",
           calendarStartDate = "2010/1/1",
           calendarEndDate = "2010/12/31"
       ),
       createDemographicSubset(
          name = ">=18",
          ageMin = 18
        )
    )
)

cohortDefinitionSet <- cohortDefinitionSet |>
  addCohortSubsetDefinition(subsetDef, targetCohortIds = c(1,2,3,4)) # If not specified, it will apply to all

generateCohortDefinitionSet(...)

This will save you a lot of time and keep clutter out of atlas.

I meant the other way around. denominator is the Males in 2010 for the entire db and the numerator is the study population. Can CG build a set of year, gender, race cohorts with no index event as well? The term "cohort" is used liberally here, more like id gender race year counts...could be out of scope of CG.

But yes, the subsetting you outline here is highly useful and have done it within studies.

azimov commented 2 months ago

I meant the other way around. denominator is the Males in 2010 for the entire db and the numerator is the study population. Can CG build a set of year, gender, race cohorts with no index event as well? The term "cohort" is used liberally here, more like id gender race year counts...could be out of scope of CG.

This could be done with templates, but it might make your DBA sweat as storing all demographics for males in a given year could be huge percentage of peeople, so I would say that it would be much better to just generate counts in this context (Patrick will have a lot to say to you about computing incidence rates in this sort of population though).

Generally something like this should live in the CohortIncidence package but there we're trying to be quite specific here - for meaningful measures you want a target population and an outcome as well as a time at risk.