callahantiff / PheKnowVec

Translational Computational Phenotyping
2 stars 0 forks source link

Query Verification: Sickle Cell Disease Cohort #105

Closed callahantiff closed 5 years ago

callahantiff commented 5 years ago

@mgkahn - Can you please help me verify the query to select Sickle Cell Disease patients?


COHORT CRITERIA Case Criteria:


Control Criteria:



Cohort Logic Table

COHORT CHUNK LOGICAL OPERATOR
CASE dx_case_inclusion_criteria_1 AND
CASE visit_inclusion_criteria_1 AND
CASE dx_case_exclusion_criteria_1 ---

Covariates


SQL Query

WITH dx_case_inclusion_criteria_1 AS (
  SELECT co.person_id, cohort.standard_code_set AS code_set 
  FROM 
    CHCO_DeID_Oct2018.condition_occurrence co,
    CHCO_DeID_Oct2018.SICKLECELLDISEASE_COHORT_VARS cohort
  WHERE 
    co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
    AND cohort.standard_code_set = {code_set_group}
    GROUP BY co.person_id, cohort.standard_code_set
  HAVING
    COUNT(DISTINCT co.condition_concept_id) >= 1
),

visit_inclusion_criteria_1 AS (
  SELECT person_id FROM 

    -- identify patients with at least 2 outpatient visits at least 30 days apart
    (SELECT v1.person_id
    FROM 
      CHCO_DeID_Oct2018.visit_occurrence v1,
      CHCO_DeID_Oct2018.visit_occurrence v2
    WHERE 
      v1.person_id = v2.person_id
      AND v1.visit_concept_id in (9202)
      AND v2.visit_concept_id in (9202)
      AND v1.visit_start_datetime < v2.visit_start_datetime
      AND DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) >= 30
      GROUP BY v1.person_id, v1.visit_start_datetime, v2.visit_start_datetime
    HAVING
      COUNT(DISTINCT v1.visit_occurrence_id) >= 2
    ORDER BY
      v1.visit_start_datetime, v2.visit_start_datetime)

  UNION ALL

    -- identify patients with at least 1 hospitalization
    (SELECT person_id FROM 
        CHCO_DeID_Oct2018.visit_occurrence
      WHERE 
        visit_concept_id in (9201, 9203)
      GROUP BY
        person_id, visit_occurrence_id
      HAVING
        COUNT(DISTINCT visit_occurrence_id) >= 1)
),

dx_case_exclusion_criteria_1 AS (
  SELECT v.person_id, cohort.standard_code_set AS code_set
  FROM 
    CHCO_DeID_Oct2018.visit_occurrence v,
    CHCO_DeID_Oct2018.care_site c, 
    CHCO_DeID_Oct2018.condition_occurrence co, 
    CHCO_DeID_Oct2018.SICKLECELLDISEASE_COHORT_VARS cohort 
  JOIN
    (SELECT DISTINCT v.person_id, COUNT(DISTINCT co.condition_concept_id) AS trait_count, cohort.standard_code_set AS code_set
     FROM 
      CHCO_DeID_Oct2018.visit_occurrence v,
      CHCO_DeID_Oct2018.care_site c, 
      CHCO_DeID_Oct2018.condition_occurrence co, 
      CHCO_DeID_Oct2018.SICKLECELLDISEASE_COHORT_VARS cohort 
     WHERE co.visit_occurrence_id = v.visit_occurrence_id 
      AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
      AND cohort.phenotype_definition_number = 2 
      AND cohort.standard_code_set = {code_set_group}
     GROUP BY 
      v.person_id, code_set) b
    ON v.person_id = b.person_id
   WHERE co.visit_occurrence_id = v.visit_occurrence_id 
    AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
    AND cohort.phenotype_definition_number = 1 
  GROUP BY 
    v.person_id, code_set, b.trait_count
  HAVING
    COUNT(DISTINCT co.condition_concept_id) > b.trait_count
)

SELECT person_id, code_set, 'SICKLECELLDISEASE_CASE_TYPE1' AS cohort_type FROM (
  SELECT person_id, code_set FROM dx_case_inclusion_criteria_1
   WHERE person_id IN (SELECT person_id FROM visit_inclusion_criteria_1)
  EXCEPT DISTINCT
  SELECT person_id, code_set FROM dx_case_exclusion_criteria_1)
;
mgkahn commented 5 years ago

dx_case_inclusion_criteria_1: Had to look up LAG(). How did you ever find that one. Must have been lots of Googling. Based on the examples in GBQ documentation, don't you need to PARTITION BY PATIENT_ID rather than just ORDER BY PATIENT_ID, VISIT_START_DATETIME. Else, won't the last row of the previous patient and the first row of the next patient be compared in your DATETIME_DIFF?

mgkahn commented 5 years ago

dx_case_inclusion_criteria_1: What if visit1-visit2 < 30 days and visit2-visit3<30 days but visit1-visit3 > 30 days. Won't your DATETIME_DIFF using LAG() miss that the patient meets criteria because of visit1-visit3? Seems you need all combinations of visit_start_dt so you can find any combo that is >30 days, not just adjacent ones. (I think your LAG() logic only look at adjacent days but I've never used it)

mgkahn commented 5 years ago

dx_case_inclusion_criteria_1: "History of" conditions are in observation per OHDSI conventions. Not sure if PEDSnet is following that convention. Your query doesn't include any relevant conditions found in the observation table. Since these events will not have a relevant time stamp, not sure how you are supposed to use them in the calculation of >30 days part anyway. So I don't think this is a big loss. But just wanted to note that you are not grabbing history of conditions.

mgkahn commented 5 years ago

dx_case_inclusion_criteria_1: Where is the "or one hospitalization" option? I see it in dx_case_inclusion_criteria_2 but not in criteria_1.

mgkahn commented 5 years ago

dx_case_inclusion_criteria_1 and 2: Why not use visit__concept_id = 8061688 for outpatient visits rather than using admitting_source_concept_id in criteria_1 and care site.place_of_service in criteria_2? This is a standard code. image

mgkahn commented 5 years ago

dx_case_exclusion_criteria_1: I am assuming: cohort.phenotype_definition_number = 2 are SS trait diagnoses cohort.phenotype_definition_number = 1 are SS disease diagnoses

mgkahn commented 5 years ago

Your final logic statement: Will the EXCEPT DISTINCT work as you want because of the code_set variable? Seems if somebody is included based on one code_set but excluded because of some other code_set, they won't be excluded because of the EXCEPT DISTINCT. Is this right? Should they get excluded no matter what code_set they are in? If so, you may need to rewrite to exclude based on a NOT IN clause using only PERSON_IDs in the dx_case_exclusion_criteria_1 CTE.

mgkahn commented 5 years ago

No more comments on SS.

callahantiff commented 5 years ago

dx_case_inclusion_criteria_1: Had to look up LAG(). How did you ever find that one. Must have been lots of Googling. Based on the examples in GBQ documentation, don't you need to PARTITION BY PATIENT_ID rather than just ORDER BY PATIENT_ID, VISIT_START_DATETIME. Else, won't the last row of the previous patient and the first row of the next patient be compared in your DATETIME_DIFF?

Definitely found it from an epic Googling session! šŸ† Good catch, thank you for pointing that out! I will make the change.

callahantiff commented 5 years ago

dx_case_inclusion_criteria_1 and 2: Why not use visit__concept_id = 8061688 for outpatient visits rather than using admitting_source_concept_id in criteria_1 and care site.place_of_service in criteria_2? This is a standard code. image

Great suggestion, that's much simpler!

callahantiff commented 5 years ago

dx_case_inclusion_criteria_1: Where is the "or one hospitalization" option? I see it in dx_case_inclusion_criteria_2 but not in criteria_1.

Fixed!

callahantiff commented 5 years ago

dx_case_inclusion_criteria_1: "History of" conditions are in observation per OHDSI conventions. Not sure if PEDSnet is following that convention. Your query doesn't include any relevant conditions found in the observation table. Since these events will not have a relevant time stamp, not sure how you are supposed to use them in the calculation of >30 days part anyway. So I don't think this is a big loss. But just wanted to note that you are not grabbing history of conditions.

Thank you for pointing this out. I'm really glad we can have a record of this!

callahantiff commented 5 years ago

Your final logic statement: Will the EXCEPT DISTINCT work as you want because of the code_set variable? Seems if somebody is included based on one code_set but excluded because of some other code_set, they won't be excluded because of the EXCEPT DISTINCT. Is this right? Should they get excluded no matter what code_set they are in? If so, you may need to rewrite to exclude based on a NOT IN clause using only PERSON_IDs in the dx_case_exclusion_criteria_1 CTE.

I think it's ok as written. We want to allow for different patients to be included and excluded depending on the code set.

callahantiff commented 5 years ago

dx_case_exclusion_criteria_1: I am assuming: cohort.phenotype_definition_number = 2 are SS trait diagnoses cohort.phenotype_definition_number = 1 are SS disease diagnoses

Yep, that's right!

callahantiff commented 5 years ago

dx_case_inclusion_criteria_1: What if visit1-visit2 < 30 days and visit2-visit3<30 days but visit1-visit3 > 30 days. Won't your DATETIME_DIFF using LAG() miss that the patient meets criteria because of visit1-visit3? Seems you need all combinations of visit_start_dt so you can find any combo that is >30 days, not just adjacent ones. (I think your LAG() logic only look at adjacent days but I've never used it)

Hm...That's an interesting point. How do we write a query which does that? I got pretty stuck on this part of the query originally, which is why I landed on LAG(). Are you suggesting messing with the ORDER BY argument and/or also including the LEAD() function?

mgkahn commented 5 years ago

In abstract you have two visit-occurrences, v1 and V2, joined by patid and in your where clause v1.starttime < v2 start time and datediff(v1,v2,day)>30 The JOIN and inequality gives you all combinations of visits where v1 occurs before v2

Michael G. Kahn MD, PhD Professor with Tenure, Section of Clinical Informatics, Department of Pediatrics Co-Director & Translational Informatics Core Director, Colorado Clinical and Translational Sciences Institute Director, Health Data Compass Associate Director, Colorado Center for Personalized Medicine University of Colorado Denver Anschutz Medical Campus Medical Director, Research Informatics Children's Hospital Colorado Research Institute Aurora, Colorado E: Michael.Kahn@ucdenver.edu P: 303-724-8334


From: Tiffany J. Callahan notifications@github.com Sent: Sunday, August 4, 2019 9:42:57 PM To: callahantiff/PheKnowVec PheKnowVec@noreply.github.com Cc: Kahn, Michael Michael.Kahn@ucdenver.edu; Mention mention@noreply.github.com Subject: Re: [callahantiff/PheKnowVec] Query Verification: Sickle Cell Disease Cohort (#105)

dx_case_inclusion_criteria_1: What if visit1-visit2 < 30 days and visit2-visit3<30 days but visit1-visit3 > 30 days. Won't your DATETIME_DIFF using LAG() miss that the patient meets criteria because of visit1-visit3? Seems you need all combinations of visit_start_dt so you can find any combo that is >30 days, not just adjacent ones. (I think your LAG() logic only look at adjacent days but I've never used it)

Hm...That's an interesting point. How do we write a query which does that? I got pretty stuck on this part of the query originally, which is why I landed on LAG(). Are you suggesting messing with the ORDER BY argument and/or also including the LEAD() function?

ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/callahantiff/PheKnowVec/issues/105?email_source=notifications&email_token=AA557TXAN27VZ5E46GCNPDLQC6OUDA5CNFSM4IF3OQ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QTDPI#issuecomment-518074813, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA557TQY5FTV3MD56W7T7VLQC6OUDANCNFSM4IF3OQ3A.

callahantiff commented 5 years ago

In abstract you have two visit-occurrences, v1 and V2, joined by patid and in your where clause v1.starttime < v2 start time and datediff(v1,v2,day)>30 The JOIN and inequality gives you all combinations of visits where v1 occurs before v2 Michael G. Kahn MD, PhD Professor with Tenure, Section of Clinical Informatics, Department of Pediatrics Co-Director & Translational Informatics Core Director, Colorado Clinical and Translational Sciences Institute Director, Health Data Compass Associate Director, Colorado Center for Personalized Medicine University of Colorado Denver Anschutz Medical Campus Medical Director, Research Informatics Children's Hospital Colorado Research Institute Aurora, Colorado E: Michael.Kahn@ucdenver.edu P: 303-724-8334 ā€¦ ____ From: Tiffany J. Callahan notifications@github.com Sent: Sunday, August 4, 2019 9:42:57 PM To: callahantiff/PheKnowVec PheKnowVec@noreply.github.com Cc: Kahn, Michael Michael.Kahn@ucdenver.edu; Mention mention@noreply.github.com Subject: Re: [callahantiff/PheKnowVec] Query Verification: Sickle Cell Disease Cohort (#105) dx_case_inclusion_criteria_1: What if visit1-visit2 < 30 days and visit2-visit3<30 days but visit1-visit3 > 30 days. Won't your DATETIME_DIFF using LAG() miss that the patient meets criteria because of visit1-visit3? Seems you need all combinations of visit_start_dt so you can find any combo that is >30 days, not just adjacent ones. (I think your LAG() logic only look at adjacent days but I've never used it) Hm...That's an interesting point. How do we write a query which does that? I got pretty stuck on this part of the query originally, which is why I landed on LAG(). Are you suggesting messing with the ORDER BY argument and/or also including the LEAD() function? ā€” You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#105?email_source=notifications&email_token=AA557TXAN27VZ5E46GCNPDLQC6OUDA5CNFSM4IF3OQ3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QTDPI#issuecomment-518074813>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA557TQY5FTV3MD56W7T7VLQC6OUDANCNFSM4IF3OQ3A.

OK, I think I follow (and I feel dumb for not doing this in the first place). So, something like this? Note that this is not tailored to the solution yet, but a more general query as I just want to make sure I understand.

SELECT v1.person_id, v1.visit_start_datetime AS v1_date, v2.visit_start_datetime AS v2_date, DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) > 30

FROM CHCO_DeID_Oct2018.visit_occurrence v1
JOIN (SELECT person_id, visit_start_datetime
        FROM CHCO_DeID_Oct2018.visit_occurrence
       WHERE visit_concept_id IN (9202)) v2 
ON v1.person_id = v2.person_id

WHERE 
  v1.visit_concept_id IN (9202)
  AND v1.visit_start_datetime < v2.visit_start_datetime 
  AND DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) > 30

GROUP BY v1.person_id, v1.visit_start_datetime, v2.visit_start_datetime
ORDER BY v1.visit_start_datetime, v2.visit_start_datetime;
mgkahn commented 5 years ago

Much easier than what you wrote:

SELECT v1.person_id, v1.visit_start_datetime AS First_Visit, v2.visit_start_datetime AS Second_Visit, DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) AS Days_Between_Visits FROM CHCO_DeID_Oct2018.visit_occurrence v1 JOIN CHCO_DeID_Oct2018.visit_occurrence v2 ON (v1.person_id = v2.person_id) WHERE v1.visit_concept_id in (9202) AND v2.visit_concept_id in (9202) AND v1.visit_start_datetime < v2.visit_start_datetime / ensures v1 is always before v2 / AND DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) > 30 ORDER BY v1.visit_start_datetime, v2.visit_start_datetime;

image

callahantiff commented 5 years ago

Much easier than what you wrote:

SELECT v1.person_id, v1.visit_start_datetime AS First_Visit, v2.visit_start_datetime AS Second_Visit, DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) AS Days_Between_Visits FROM CHCO_DeID_Oct2018.visit_occurrence v1 JOIN CHCO_DeID_Oct2018.visit_occurrence v2 ON (v1.person_id = v2.person_id) WHERE v1.visit_concept_id in (9202) AND v2.visit_concept_id in (9202) AND v1.visit_start_datetime < v2.visit_start_datetime / ensures v1 is always before v2 / AND DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) > 30 ORDER BY v1.visit_start_datetime, v2.visit_start_datetime;

image

Looks great, thanks!

callahantiff commented 5 years ago

OK, here is the updated query:

WITH dx_case_inclusion_criteria_1 AS (
  SELECT co.person_id, cohort.standard_code_set AS code_set 
  FROM 
    CHCO_DeID_Oct2018.condition_occurrence co,
    CHCO_DeID_Oct2018.SICKLECELLDISEASE_COHORT_VARS cohort
  WHERE 
    co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
    AND cohort.standard_code_set = {code_set_group}
    GROUP BY co.person_id, cohort.standard_code_set
  HAVING
    COUNT(DISTINCT co.condition_concept_id) >= 1
),

visit_inclusion_criteria_1 AS (
  SELECT person_id FROM 

    -- identify patients with at least 2 outpatient visits at least 30 days apart
    (SELECT v1.person_id
    FROM 
      CHCO_DeID_Oct2018.visit_occurrence v1,
      CHCO_DeID_Oct2018.visit_occurrence v2
    WHERE 
      v1.person_id = v2.person_id
      AND v1.visit_concept_id in (9202)
      AND v2.visit_concept_id in (9202)
      AND v1.visit_start_datetime < v2.visit_start_datetime -- this line forces 2 distinct visits
      AND DATETIME_DIFF(DATETIME(v2.visit_start_datetime), DATETIME(v1.visit_start_datetime), DAY) >= 30
      GROUP BY v1.person_id, v1.visit_start_datetime, v2.visit_start_datetime
    ORDER BY
      v1.visit_start_datetime, v2.visit_start_datetime)

  UNION ALL

    -- identify patients with at least 1 hospitalization
    (SELECT person_id FROM 
        CHCO_DeID_Oct2018.visit_occurrence
      WHERE 
        visit_concept_id in (9201, 9203)
      GROUP BY
        person_id, visit_occurrence_id
      HAVING
        COUNT(DISTINCT visit_occurrence_id) >= 1)
),

dx_case_exclusion_criteria_1 AS (
  SELECT v.person_id, cohort.standard_code_set AS code_set
  FROM 
    CHCO_DeID_Oct2018.visit_occurrence v,
    CHCO_DeID_Oct2018.care_site c, 
    CHCO_DeID_Oct2018.condition_occurrence co, 
    CHCO_DeID_Oct2018.SICKLECELLDISEASE_COHORT_VARS cohort 
  JOIN
    (SELECT DISTINCT v.person_id, COUNT(DISTINCT co.condition_concept_id) AS trait_count, cohort.standard_code_set AS code_set
     FROM 
      CHCO_DeID_Oct2018.visit_occurrence v,
      CHCO_DeID_Oct2018.care_site c, 
      CHCO_DeID_Oct2018.condition_occurrence co, 
      CHCO_DeID_Oct2018.SICKLECELLDISEASE_COHORT_VARS cohort 
     WHERE co.visit_occurrence_id = v.visit_occurrence_id 
      AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
      AND cohort.phenotype_definition_number = 2 
      AND cohort.standard_code_set = {code_set_group}
     GROUP BY 
      v.person_id, code_set) b
    ON v.person_id = b.person_id
   WHERE co.visit_occurrence_id = v.visit_occurrence_id 
    AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
    AND cohort.phenotype_definition_number = 1 
  GROUP BY 
    v.person_id, code_set, b.trait_count
  HAVING
    COUNT(DISTINCT co.condition_concept_id) > b.trait_count
)

SELECT person_id, code_set, 'SICKLECELLDISEASE_CASE_TYPE1' AS cohort_type FROM (
  SELECT person_id, code_set FROM dx_case_inclusion_criteria_1
   WHERE person_id IN (SELECT person_id FROM visit_inclusion_criteria_1)
  EXCEPT DISTINCT
  SELECT person_id, code_set FROM dx_case_exclusion_criteria_1)
;

Do you approve? You will notice it has changed slightly from before. I re-read the phenotype description from eMERGE (see below) and realized I had made a mistake in my first pass. With the changes shown above, I feel like it should now be good to go!

Screen Shot 2019-08-07 at 20 14 56
callahantiff commented 5 years ago

TO DO: