OHDSI / CohortGenerator

An R package for instantiating cohorts using data in the CDM.
https://ohdsi.github.io/CohortGenerator/
11 stars 10 forks source link

Negation of cohort subset operator must join on subjetct_id AND start_date #167

Closed chrisknoll closed 2 weeks ago

chrisknoll commented 2 weeks ago

When running a negation cohort subset operator, you may miss a person from the result because they have multiple events in their cohort that passes on one date and fails on another.

This is the cohort subset query:

SELECT B.subject_id,
       B.cohort_start_date,
       B.cohort_end_date --INTO #S_1
FROM (SELECT T.subject_id,
             T.cohort_start_date,
             T.cohort_end_date
      FROM #cohort_sub_base T
        JOIN @cohortTable S ON T.subject_id = S.subject_id
      WHERE S.cohort_definition_id IN (12467)
      AND   (S.cohort_start_date >= DATEADD(d,-365,T.cohort_start_date) AND S.cohort_start_date <= DATEADD(d,30,T.cohort_start_date))
      AND   (S.cohort_end_date >= DATEADD(d,-9999,T.cohort_start_date) AND S.cohort_end_date <= DATEADD(d,9999,T.cohort_start_date))
      GROUP BY T.subject_id,
               T.cohort_start_date,
               T.cohort_end_date
      HAVING COUNT(DISTINCT S.COHORT_DEFINITION_ID) >= 1) A
  ** RIGHT JOIN #cohort_sub_base B ON B.subject_id = A.subject_id ** 
WHERE A.subject_id IS NULL
AND B.SUBJECT_ID = 1254018103

Note the code in the RIGHT JOIN. The right join is what is going to identify those people who were in the base cohort but not in the resulting subset to NEGATE and return those people who didn't match on the cohort to subset with.

However, in a case where you have multiple events per person, you can have a case where a person matches on one date but not another, but the RIGHT JOIN is only joining on subject_id. The correct code should match on subject_id and cohort_start_date any time we are trying to identify the matching records:

SELECT B.subject_id,
       B.cohort_start_date,
       B.cohort_end_date --INTO #S_1
FROM (SELECT T.subject_id,
             T.cohort_start_date,
             T.cohort_end_date
      FROM #cohort_sub_base T
        JOIN @cohortTable S ON T.subject_id = S.subject_id
        AND s.SUBJECT_ID = 1254018103
      WHERE S.cohort_definition_id IN (12467)
      AND   (S.cohort_start_date >= DATEADD(d,-365,T.cohort_start_date) AND S.cohort_start_date <= DATEADD(d,30,T.cohort_start_date))
      AND   (S.cohort_end_date >= DATEADD(d,-9999,T.cohort_start_date) AND S.cohort_end_date <= DATEADD(d,9999,T.cohort_start_date))
      GROUP BY T.subject_id,
               T.cohort_start_date,
               T.cohort_end_date
      HAVING COUNT(DISTINCT S.COHORT_DEFINITION_ID) >= 1) A
  RIGHT JOIN #cohort_sub_base B ON B.subject_id = A.subject_id
   *** AND b.COHORT_START_DATE = a.COHORT_START_DATE ***
WHERE A.subject_id IS NULL
AND B.SUBJECT_ID = 1254018103
anthonysena commented 2 weeks ago

This is now fixed in develop and I've added additional unit tests to ensure the subset logic is working well when dealing with a cohort with multiple entries.