OHDSI / CohortGenerator

An R package for instantiating cohorts using data in the CDM.
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

VisitContext subset operator #93

Open chrisknoll opened 1 year ago

chrisknoll commented 1 year ago

Currently, in order to subset a cohort based on visit type (ER, inpatient, etc) we need to create different 'visit' cohorts and use cohortSubset operators on it. The problem is that making visit cohorts effectively copies the VISIT_OCCURRENCE table, which is bad.

The proposal here is to define a new 'visitContextSubset' operator that will take certain parameters (TBD) and yield the cohort subset based on the visit parameters provided.

azimov commented 1 year ago

This should be relatively straightforward.

An outline of what's needed is follows:

Out of scope for this package but of note with this implementation (or any expansion of functionality) is that it will break any Strategus packages that will load and we don't have management for how that's handled right now.

Happy to do this, but it might be better to get some knowledge transfer around this.

chrisknoll commented 1 year ago

Agreed with all except for the breaking packages part. My thoughts on this is that if we have a Strategus design that includes these new subset operators, prior Strategist designs should continue to function (I consider something 'beaking' when something that used to work no longer works). I could be misunderstanding your point, apologies if I do.

We spoke about the case where a new design with new options would not work running on an older version (because the older version would not handle the new options) and that trying to execute a 'buildDesign' script that calls out to new functions (such as createVisitContextSubset) against an old version of the package would also not work, but this is more a case of 'forward compatability' vs. 'backwards compatability'. I think we described a mechanism where we can execute the build design scripts under the context of a renv session that loads the proper dependencies such that we won't run into this problem, but we don't have a solution to that yet.

Onto the other questions you raised (and we should get some feedback from @gowthamrao since I think he would have the most benefit from this):

gowthamrao commented 1 year ago

Thank you for looping me into this important topic. Yes, creating cohorts of visits to subset is essentially copying full visit table records - not efficient. This would mean using atleast the visit_occurrence and visit_detail tables and corresponding _concept_id and _source_concept_id (i can explain why _source_concept_id in a different thread) as part of the cohortSubset input specifications (i.e. as a subset operator)

Use case: persons with anaphylaxis, subset to visit (inpatient, emergency room, doctor's office visit). Discussed here https://forums.ohdsi.org/t/phenotype-phebrurary-2023-p2-anaphylaxis/18193 . In this case, if we had taken a cohort of person evens with condition anaphyalxis, and subset by visitContext - we probably would have picked up the issue using the tool. I am exploring if this value proposition is true, by using visit cohort for now (which as describe above entails creating a cohort of all records in visit_occurrence table).

gowthamrao commented 1 year ago

(I know this request basically amounts to asking for circe logic embedded into the subset operator (or capR like logic). Just to be clear, I am NOT proposing that at all. Just sharing the use case)

In this second post, i will explain my rationale for asking for support of _source_concept_id fields in visit_occurrence (and visit_detail)

I need to identify visits that are TRUE 'Doctors office visit'. What I am trying to say here is that 'TRUE Doctors office visits' does not equate to 'Outpatient Visit'. Visits such as urgent care, or outpatient surgery, chemotherapy infusion visits are 'Outpatient Visit' but not the typical 'Doctors office visit' where we are going to a clinic.

One logical way to do this is: 1) Enter by these visit domain are more likely to represent such visits image

2) Apply inclusion criteria like this image

3) And exist like this (not use of visit_end_date to exit) image

But the following issues make generalization difficult, mostly driven by ETL conventions 1) Evolving concept representation of visit with EnM codes changing domains from procedure to visit https://forums.ohdsi.org/t/cpt-hierarchy-errors-lost-children-in-2023-and-changed-domains/18383/2 2) 'legacy' convention of using top tier concepts in visit_concept_id (only using inpatient, emergency room, inpatient and emergency room, outpatient) vs now the available granular 100s of standard concepts in visit domain 3) some non standardized convention of using visit_detail to populate these details, while keep the visit_occurrence lossy 4) having these details represented in visit_source_concept_id

An approach like this appears to provide universal solution -- hence the ask to support both visit_occurrence and visit_details + _concept_id and _source_concept_id

image

azimov commented 1 year ago

@gowthamrao Supporting the visit_source_concept_id in the spec would be fine, from an implementation perspective, but the parameterisation in this package will require the user to know the source concept ids upfront. I can also imagine wanting to exclude patients via visit any types too, so negation should be included.

How this list is maintained and used (e.g. in the phenotype library) is a different topic but it would absolutely be possible to create subsetting definitions that use a specific source code.

Even supporting a list of 100s of standard codes in this package seems tricky. I think for a first pass we will just support basic top level names and then require users to enter specific concept ids or source ids for more complex definitions.

gowthamrao commented 1 year ago

in this package will require the user to know the source concept ids upfront.

I think you are saying the user has to provide an array of conceptIds and that the CohortGenerator package would not maintain such an array of conceptId's. I think that is acceptable

For clarity - i mean to support the two conceptId fields visit_concept_id and visit_source_concept_id (not visit_source_code).

visit_occurrence.visit_concept_id visit_occurrence.visit_source_concept_id

visit_detail.visit_detail_concept_id visit_detail.visit_detail_source_concept_id

i dont know if we should support visit_occurrence.visit_type_concept_id visit_detail.visit_detail_type_concept_id

gowthamrao commented 1 year ago

Regarding visit_type_concept_id, i would like to request that we support visit_type_concept_id if possible.

Reason:

See this cohort definition https://atlas.ohdsi.org/#/cohortdefinition/348/definition that limits to visits with the diagnosis in primary position vs this that does not https://github.com/OHDSI/PhenotypeLibrary/blob/main/inst/cohorts/235.json

All other aspects are the same. Should we building and instantiating cohorts for everytime we want to see if the use of primary position makes any difference? Can we ask CohortGenerator to help us?

gowthamrao commented 1 year ago

Some additional thoughts based on recent discussion between @chrisknoll , @azimov and @anthonysena

  1. We agreed that it maybe useful to consider supporting a set of circe operations here similar to how Characterization in Atlas/webapi supports. We marked this a road map.
  2. For initial implementation - we agreed to support visit_occurrence.visit_concept_id based subset only.
gowthamrao commented 1 year ago

I requested a reindex operator here https://github.com/OHDSI/CohortGenerator/issues/94 but was wondering how to solve these:

Use case: i have a target cohort (anaphylaxis). I want to subset to inpatient visit. The above discussion would support it.

But sometimes, the diagnosis of anaphylaxis maybe overlapping with inpatient visit such as that cohort_start_date of anaphylaxis cohort is > visit_start_date and < visit_end_date.

This brings the situation of the output of the visitContext subset operator being

  1. subject_id, anaphylaxis.cohort_start_date, anaphylaxis.cohort_end_date, or
  2. 'subject_id, visit_occurrence.visit_start_date AS cohort_start_date, visit_occurrence.visit_end_date AS cohort_end_date`

i.e. should we allow for reindexing in the visitContext subset operator?

In the absence of reindexing in visitContext, we have to do this - which we want to not do. @chrisknoll 1) use all visits that are inpatient as the target cohort 2) subset the all inpatient visits by the anaphylaxis cohort (subset)

chrisknoll commented 1 year ago

Subset operators aren't used to reassign cohort dates, but instead to subset the episodes found in a given cohort. You want to be able to go from a subset back to the parent target cohort by subject_id, start_date, end date (ie: subset is subsumed by T).

Re-indexing is an idea, but I think beyond the scope of this issue.

gowthamrao commented 1 year ago

I was thinking about the team conversation the other day on this topic. I like the idea of modeling visit subset operator based on cohortSubset operator.

Breaking down: we can think of two steps

1) Deriving a cohort like table first visit table(s). i.e., we support a set of operators that operate on the visit/visits table to yield something that is like a cohort table i.e. of the form id, start_date, end_date (i.e. person_id, visit_start_date, visit_end_date) 2) A subset operator that subsets the target cohorts with that of 1.

gowthamrao commented 1 year ago

Elaborating 1 Deriving a cohort like table first visit table(s): I am thinking of this as two steps

visitDefinition1 <-
  createVisitDefinition(
    visitTableName = "visit_occurrence",
    visitConceptIds = c(9201),
    visitSourceConceptIds = NULL,
    visitTypeConceptIds = NULL
  )

that (when we apply as a subsetOperator in a subsequent step) yields an SQL that is something like

SELECT person_id, visit_start_date, max(visit_end_date) visit_end_date
FROM visit_occurrence v
INNER JOIN
  (
      SELECT DISTINCT SUBJECT_ID FROM cohort WHERE cohort_definition_id = @target_cohort_id) c
ON v.person_id = c.subject_id
  WHERE visit_concept_id = 9201
              AND visit_end_date >= visit_start_date
GROUP BY person_id, visit_start_date;

We can make a list of these something like list(visitDefinition1, visitDefinition2, visitDefinition3) that yields an SQL something like


SELECT person_id, visit_start_date, max(visit_end_date) visit_end_date
FROM
(
# definition 1
  SELECT person_id, visit_start_date, max(visit_end_date) visit_end_date
  FROM visit_occurrence v
  INNER JOIN
    (
        SELECT DISTINCT SUBJECT_ID FROM cohort WHERE cohort_definition_id = @target_cohort_id) c
  ON v.person_id = c.subject_id
  WHERE visit_concept_id = 9201
              AND visit_end_date >= visit_start_date
  GROUP BY person_id, visit_start_date

  UNION

  # definition 2
  SELECT person_id, visit_start_date, max(visit_end_date) visit_end_date
  FROM visit_detail v
  INNER JOIN
  (
    SELECT DISTINCT SUBJECT_ID FROM cohort WHERE cohort_definition_id = @target_cohort_id) c
  ON v.person_id = c.subject_id
  WHERE visit_concept_id = 9201
              AND visit_end_date >= visit_start_date
  GROUP BY person_id, visit_start_date
) f
GROUP BY person_id, visit_start_date;
gowthamrao commented 1 year ago

Elaborating 2 - A subset operator that subsets the target cohorts with the output of 1. Now we can think of reusing the createSubsetCohortWindow and createCohortSubset like functionality with something like createSubsetVisitWindow and createVisitSubset

chrisknoll commented 1 year ago

Hi, @gowthamrao , I think these ideas are going beyond the intended purpose of the subset operator which is: identify the cohort episodes that have a period that is in proximity to a visit (based on the specified visit concept ids).

Your examples seem to be finding visit_occurrence records, but the purpose of a subset operator is to find the COHORT records based on a subset query. The other confusing part about your queries is that they seem to result in nothing when you have > 1 day visits:

  SELECT person_id, visit_start_date, max(visit_end_date) visit_end_date
  FROM visit_occurrence v
  INNER JOIN
    (
        SELECT DISTINCT SUBJECT_ID FROM cohort WHERE cohort_definition_id = @target_cohort_id) c
  ON v.person_id = c.subject_id
  WHERE visit_concept_id = 9201
              AND visit_end_date >= visit_start_date
  GROUP BY person_id, visit_start_date

When you say visit_end_date >= visit_start_date, this will only be true for single day visits.

Update: I misread that thinking it said start_date >= end_date, but the confusion remains: why even specify that WHERE clause when all visits will have end_dates => start dates? When reading your code, I'm trying to digest every thing you are specifying and find reason for it, but I can't find a reason for this one.

Let's stay on target as to what a 'visit subset operator' is intended to do (find cohort episodes relative to a visit) and if you want to do something different like go from a first visit to the end of the last visit (ie: person_id, visit_start_date, max(visit_end_date) visit_end_date), then you'd create a Visit cohort (all events) with a gap days of 99999 such that the cohort episode will go from the first visit to the last visit (and include all time between). With that type of cohort, you'd just use your normal CohortSubsetOperator to do the job.

gowthamrao commented 1 year ago

why even specify that WHERE clause when all visits will have end_dates => start dates?

Aah - this is because i am paranoid. Yes, by convention visit_end_date >= visit_start_date, but i have seen bad ETL. And this just handles a bad ETL. You can take it off if you dont like it.

gowthamrao commented 1 year ago

I think you are asking why the max(visit_end_date) ?

The answer is -- unlike records in the cohort table, records in the visit table may be overlapping. So a max would remove possible overlaps.

You can think of my proposed sql as follows if that helps

SELECT person_id, visit_start_date, visit_end_date
  FROM visit_occurrence v
  INNER JOIN
    (
        SELECT DISTINCT SUBJECT_ID FROM cohort WHERE cohort_definition_id = @target_cohort_id) c
  ON v.person_id = c.subject_id
  WHERE visit_concept_id = 9201
gowthamrao commented 1 year ago

They key idea i was proposing is that we can think of the visitSubsetOperator as something like this

Breaking down: we can think of two steps

Deriving a cohort like table first visit table(s). i.e., we support a set of operators that operate on the visit/visits table to yield something that is like a cohort table i.e. of the form id, start_date, end_date (i.e. person_id, visit_start_date, visit_end_date) A subset operator that subsets the target cohorts with that of 1.

chrisknoll commented 1 year ago

Ok, when you say 'delivering a cohort-like table', I'd say just resort to creating a cohort table with any rules you want to apply to visit occurrences to lump them together. We targeted this function to allow you to subset cohorts by visit context so all you really need is visit concept IDs and time windows. The method you describe above would have a person with a visit in 2015 and 2019 span a timewindow that starts in 2015 and ends in 2019...this is not exactly what visit context means, but rather you're defining a phenotype of 'experienced a visit` that combines different visits into a single one. When you do this, you loose track of intermediate visits, which may not matter for your specific usecase (which i say again, use the cohort subset functionality) but when thinking specifically about visit_occurrence we want to do more specific things like know their actual start/ends, maybe look at provider specialty, etc. When you merge visits together like that, you lose that context.

gowthamrao commented 1 year ago

I think you are saying is that after a certain level of complexity, we reach a point of diminishing return and could just use the general purpose CohortSubset operator where the subset cohort is another circe based cohort definition for visit. Yup - i agree with you there.

So lets agree that the solution should be something as simple (represented in pseudo code)

SELECT c.*
FROM cohort c
INNER JOIN visit_occurrence v
ON person + overlap time window rules  of v.start/v.end/c.start/c.end with offsets
WHERE visit_concept_id IN (an array of concept ids)
chrisknoll commented 1 year ago

That's fine, tho It might be interesting to incorporate provider into the subsetting (providers are associated by visits) so maybe that can be a phase 2 enhancement.