callahantiff / PheKnowVec

Translational Computational Phenotyping
2 stars 0 forks source link

Query Verification: ADHD Cohort #100

Closed callahantiff closed 5 years ago

callahantiff commented 5 years ago

@mgkahn - Can you please help me verify the query to select ADHD patients?


COHORT CRITERIA Case Criteria:
CASE TYPE 1

CASE TYPE 2


Control Criteria:


Cohort Logic Table

COHORT CHUNK LOGICAL OPERATOR
CASE TYPE 1 age_criteria_1 AND
CASE TYPE 1 dx_case_inclusion_criteria_1 AND
CASE TYPE 1 rx_case_inclusion_criteria_1 AND
CASE TYPE 1 dx_case_exclusion_criteria_1 ---
CASE TYPE 2 age_criteria_1 AND
CASE TYPE 2 dx_case_inclusion_criteria_2 AND
CASE TYPE 2 dx_case_exclusion_criteria_1 ---
CONTROL age_criteria_2 AND
CONTROL visit_criteria_1 AND
CONTROL rx_control_exclusion_criteria_1 OR
CONTROL dx_control_exclusion_criteria_1 OR
CONTROL dx_control_exclusion_criteria_2 ---

NOTE.

  1. Replace all instances of {database} with CHCO_DeID_Oct2018
  2. Comment out all instances of {code_set_group}

Query can be found here and is also included below:

WITH age_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.person p, 
  {database}.condition_occurrence co, 
  {database}.ADHD_COHORT_VARS cohort 
WHERE p.person_id = co.person_id 
  AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 1
  AND cohort.standard_code_set = {code_set_group}
GROUP BY
  co.person_id, p.birth_datetime, cohort.standard_code_set
HAVING
  DATETIME_DIFF(DATETIME(MIN(co.condition_start_datetime)), DATETIME(p.birth_datetime), DAY) >= 1460
),

dx_case_inclusion_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.ADHD_COHORT_VARS cohort, 
  {database}.visit_occurrence v 
WHERE 
  co.visit_occurrence_id = v.visit_occurrence_id 
  AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) 
  AND cohort.phenotype_definition_number = 1 
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1 
  AND COUNT(DISTINCT v.visit_start_date) >= 1
),

rx_case_inclusion_criteria_1 AS (
SELECT de.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.drug_exposure de, 
  {database}.ADHD_COHORT_VARS cohort 
WHERE 
  de.drug_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 2 
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  de.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT de.drug_concept_id) >= 1
),

dx_case_exclusion_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set
FROM 
  {database}.condition_occurrence co,
  {database}.ADHD_COHORT_VARS cohort
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64)
  AND cohort.phenotype_definition_number = 3
  AND cohort.standard_code_set = {code_set_group}
GROUP BY co.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1
),

dx_case_inclusion_criteria_2 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.ADHD_COHORT_VARS cohort, 
  {database}.visit_occurrence v 
WHERE 
  co.visit_occurrence_id = v.visit_occurrence_id 
  AND co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) 
  AND cohort.phenotype_definition_number = 1 
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1 
  AND COUNT(DISTINCT v.visit_start_date) >= 2
),

age_criteria_2 AS (
SELECT co.person_id, {code_set_group} AS code_set 
FROM 
  {database}.person p,
  {database}.condition_occurrence co
WHERE p.person_id = co.person_id 
GROUP BY
  co.person_id, p.birth_datetime
HAVING
  DATETIME_DIFF(DATETIME(MIN(co.condition_start_datetime)), DATETIME(p.birth_datetime), DAY) >= 1460
),

visit_criteria_1 AS (
SELECT v.person_id, {code_set_group} AS code_set 
FROM 
  {database}.visit_occurrence v
GROUP BY 
  v.person_id, v.visit_end_datetime 
HAVING
  DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM {database}.visit_occurrence), INTERVAL -5 YEAR)
  AND COUNT(DISTINCT v.visit_start_date) >= 2
),

rx_control_exclusion_criteria_1 AS (
SELECT de.person_id, cohort.standard_code_set AS code_set
FROM 
  {database}.drug_exposure de,
  {database}.ADHD_COHORT_VARS cohort
WHERE 
  de.drug_concept_id = CAST(cohort.standard_concept_id AS int64)
  AND cohort.phenotype_definition_number = 4
  AND cohort.standard_code_set = {code_set_group}
GROUP BY de.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT de.drug_concept_id) >= 1
),

dx_control_exclusion_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set
FROM 
  {database}.condition_occurrence co,
  {database}.ADHD_COHORT_VARS cohort
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64)
  AND cohort.phenotype_definition_number = 5
  AND cohort.standard_code_set = {code_set_group}
GROUP BY co.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1
),

dx_control_exclusion_criteria_2 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set
FROM 
  {database}.condition_occurrence co,
  {database}.ADHD_COHORT_VARS cohort
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64)
  AND cohort.phenotype_definition_number = 6
  AND cohort.standard_code_set = {code_set_group}
GROUP BY co.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1
)

SELECT * FROM
  (SELECT person_id, code_set, 'ADHD_CASE_TYPE1' AS cohort_type
    FROM (
    (SELECT person_id, code_set FROM age_criteria_1
      INTERSECT DISTINCT 
     SELECT person_id, code_set FROM dx_case_inclusion_criteria_1
      INTERSECT DISTINCT 
     SELECT person_id, code_set FROM rx_case_inclusion_criteria_1)
      EXCEPT DISTINCT
     SELECT person_id, code_set FROM dx_case_exclusion_criteria_1)
   GROUP BY person_id, code_set, cohort_type)

  UNION ALL

  (SELECT person_id, code_set, 'ADHD_CASE_TYPE2' AS cohort_type
    FROM (
    (SELECT person_id, code_set FROM age_criteria_1
      INTERSECT DISTINCT 
    SELECT person_id, code_set FROM dx_case_inclusion_criteria_2)
      EXCEPT DISTINCT
    SELECT person_id, code_set FROM dx_case_exclusion_criteria_1)
  GROUP BY person_id, code_set, cohort_type)

  UNION ALL 

  (SELECT p.person_id, {code_set_group} AS code_set, 'ADHD_CONTROL' AS cohort_type
    FROM {database}.person p
    WHERE p.person_id NOT IN (
      SELECT person_id FROM rx_control_exclusion_criteria_1
        UNION DISTINCT
      SELECT person_id FROM dx_control_exclusion_criteria_1
        UNION DISTINCT
      SELECT person_id FROM dx_control_exclusion_criteria_2)
    AND p.person_id IN (
      SELECT person_id FROM age_criteria_2
        INTERSECT DISTINCT
      SELECT person_id FROM visit_criteria_1)
  GROUP BY p.person_id, code_set, cohort_type);
mgkahn commented 5 years ago

One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient):

WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),

mgkahn commented 5 years ago

In dx_case_inclusion_criteria1 (and 2), you use admittingSOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions.

mgkahn commented 5 years ago

Same simplification as used in Age example above using DISTINCT. I won't show any more of these. GROUP BY is an expensive operation and it has some resource limitations on GBQ for large queries (it ends up having to be executed on a single node. So eliminating the HAVING where at least one instance is acceptable in the criteria is a good thing. You cannot use this simplification if you need to count two or more like you need in your DX visits criteria.

rx_case_inclusion_criteria_1 AS ( SELECT DISTINCT de.person_id, cohort.standard_code_set AS code_set FROM {database}.drug_exposure de JOIN {database}.ADHD_COHORT_VARS cohort ON (de.drug_concept_id = CAST(cohort.standard_concept_id AS int64)) WHERE cohort.phenotype_definition_number = 2 AND cohort.standard_code_set = {code_set_group} ),

mgkahn commented 5 years ago

I don't see what this clause is buying you: COUNT(DISTINCT co.condition_concept_id) >= 1 You will have at least one condition_concept_id else your JOIN will return NULL and your other COUNT clause will fail as you want.

dx_case_inclusion_criteria_2 AS ( SELECT co.person_id, cohort.standard_code_set AS code_set FROM {database}.condition_occurrence co JOIN {database}.visit_occurrence v ON (co.visit_occurrence_id = v.visit_occurrence_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) AND cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} GROUP BY co.person_id, cohort.standard_code_set HAVING COUNT(DISTINCT co.condition_concept_id) >= 1 AND COUNT(DISTINCT v.visit_start_date) > 2 ),

callahantiff commented 5 years ago

One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient):

WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),

For this comment and this one -- I had originally used DISTINCT instead of GROUP BY, but noticed that it was taking forever. In some stackoverflow post, I found someone had benchmarked a comparison to the same queries using each approach. When I switched to using GROUP BY instead, my queries all ran much faster. Are you suggesting I switch back?

callahantiff commented 5 years ago

I don't see what this clause is buying you: COUNT(DISTINCT co.condition_concept_id) >= 1 You will have at least one condition_concept_id else your JOIN will return NULL and your other COUNT clause will fail as you want.

dx_case_inclusion_criteria_2 AS ( SELECT co.person_id, cohort.standard_code_set AS code_set FROM {database}.condition_occurrence co JOIN {database}.visit_occurrence v ON (co.visit_occurrence_id = v.visit_occurrence_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) AND cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} GROUP BY co.person_id, cohort.standard_code_set HAVING COUNT(DISTINCT co.condition_concept_id) >= 1 AND COUNT(DISTINCT v.visit_start_date) > 2 ),

Very true, I was just trying to be very explicit in the syntax. Do you suggest I remove it?

mgkahn commented 5 years ago

Simplify visit_criteria_1 as with the others. Also, rather than DATETIME_ADD can't you use DATEDIFF, which may be easier to understand (does DATEDIFF work with DATETIMEs?):

WHERE DATEDIFF(visit_end_datetime,DATE(SELECT DATETIME(MAX(visit_end_datetime)) from {database}.visit_occurrence),YEAR) >=5

visit_criteria_1 AS ( SELECT DISTINCT v.person_id, {code_set_group} AS code_set FROM {database}.visit_occurrence v WHERE DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM {database}.visit_occurrence), INTERVAL -5 YEAR) ),

callahantiff commented 5 years ago

In dx_case_inclusion_criteria1 (and 2), you use admittingSOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions.

I was trying as best as possible to stick with OHDSI conventions for these queries. I also thought it would be more consistent when applying the same queries to the OMOP-MIMIC data. Are you OK with me sticking with this approach?

mgkahn commented 5 years ago

Your final logic statement -- as before you can can the GROUP BY and replace the SELECT * with SELECT DISTINCT ......

mgkahn commented 5 years ago

My example for your CONTROLS allows you to keep the code_set

callahantiff commented 5 years ago

How do I add comments to your comments? Does replying to your emails add this to the comment? From: "Tiffany J. Callahan" notifications@github.com Reply-To: callahantiff/PheKnowVec reply@reply.github.com Date: Sunday, August 4, 2019 at 3:05 PM To: callahantiff/PheKnowVec PheKnowVec@noreply.github.com Cc: "Kahn, Michael" Michael.Kahn@ucdenver.edu, Mention mention@noreply.github.com Subject: Re: [callahantiff/PheKnowVec] Query Verification: ADHD Cohort (#100) In dx_case_inclusion_criteria_1 (and 2), you use admitting_SOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions. I was trying as best as possible to stick with OHDSI conventions for these queries. I also thought it would be more consistent when applying the same queries to the OMOP-MIMIC data. Are you OK with me sticking with this approach? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#100?email_source=notifications&email_token=AA557TS6SYVITPQIFQW6UBLQC477VA5CNFSM4IF3B2XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QJVDI#issuecomment-518036109>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA557TQWGTUCFPJAIRDDN4LQC477VANCNFSM4IF3B2XA.

Can either reply via an email or right click on the three dots in the upper-right corner of the comment you want to reply to (see screenshot).

Screen Shot 2019-08-04 at 15 16 55
callahantiff commented 5 years ago

My example for your CONTROLS allows you to keep the code_set

The example sent does not include the code_set_group variable and instead is returning the name of the logic chunk. I also was having a hard time getting it to run. Sorry if I am doing something wrong.

mgkahn commented 5 years ago

One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient): WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),

For this commnt and this one -- I had originally used DISTINCT instead of GROUP BY, but noticed that it was taking forever. In some stackoverflow post, I found someone had benchmarked a comparison to the same queries using each approach. When I switched to using GROUP BY instead, my queries all ran much faster. Are you suggesting I switch back?

INTERESTING. COUNTER_INTUTIVE and I think HDC has lots of issues with running out of resources when using GROUP BY. Can you re-discover that post? I'd like to show the HDC folks. The two approaches are semantically the same so use whichever one runs best. I thought that the not GROUP BY version would be much faster because it could be parallelized more and I thought HDC had issues with GROUP BY. I KNOW we have issues with ORDER BY requiring the entire query to run on a single GBQ node.

mgkahn commented 5 years ago

I don't see what this clause is buying you: COUNT(DISTINCT co.condition_concept_id) >= 1 You will have at least one condition_concept_id else your JOIN will return NULL and your other COUNT clause will fail as you want. dx_case_inclusion_criteria_2 AS ( SELECT co.person_id, cohort.standard_code_set AS code_set FROM {database}.condition_occurrence co JOIN {database}.visit_occurrence v ON (co.visit_occurrence_id = v.visit_occurrence_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) AND cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} GROUP BY co.person_id, cohort.standard_code_set HAVING COUNT(DISTINCT co.condition_concept_id) >= 1 AND COUNT(DISTINCT v.visit_start_date) > 2 ),

Very true, I was just trying to be very explicit in the syntax. Do you suggest I remove it?

I would have said "yes" but it seems GROUP BY are not as costly as I have believed so I'm agnostic.

mgkahn commented 5 years ago

In dx_case_inclusion_criteria1 (and 2), you use admittingSOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions.

I was trying as best as possible to stick with OHDSI conventions for these queries. I also thought it would be more consistent when applying the same queries to the OMOP-MIMIC data. Are you OK with me sticking with this approach?

Sure. Just wanted to confirm reasoning.

callahantiff commented 5 years ago

One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient): WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),

For this commnt and this one -- I had originally used DISTINCT instead of GROUP BY, but noticed that it was taking forever. In some stackoverflow post, I found someone had benchmarked a comparison to the same queries using each approach. When I switched to using GROUP BY instead, my queries all ran much faster. Are you suggesting I switch back?

INTERESTING. COUNTER_INTUTIVE and I think HDC has lots of issues with running out of resources when using GROUP BY. Can you re-discover that post? I'd like to show the HDC folks. The two approaches are semantically the same so use whichever one runs best. I thought that the not GROUP BY version would be much faster because it could be parallelized more and I thought HDC had issues with GROUP BY. I KNOW we have issues with ORDER BY requiring the entire query to run on a single GBQ node.

OK, I believe this is the link that I found. I acknowledge it's not GBQ-specific, but I do know that I was unable to get my query to complete using DISTINCT and when I switch to GROUP BY it went WAY faster.

callahantiff commented 5 years ago

Simplify visit_criteria_1 as with the others. Also, rather than DATETIME_ADD can't you use DATEDIFF, which may be easier to understand (does DATEDIFF work with DATETIMEs?):

WHERE DATEDIFF(visit_end_datetime,DATE(SELECT DATETIME(MAX(visit_end_datetime)) from {database}.visit_occurrence),YEAR) >=5

visit_criteria_1 AS ( SELECT DISTINCT v.person_id, {code_set_group} AS code_set FROM {database}.visit_occurrence v WHERE DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM {database}.visit_occurrence), INTERVAL -5 YEAR) ),

I am not finding a better option using DATEDIFF. I wanted to share the results generated using DISTINCT (Query 1) versus GROUP BY (QUERY 2):

QUERY 1: 22.5 seconds returning 5406779 rows

SELECT DISTINCT v.person_id, 'fuzzy_syn_desc_stand_none_desc' AS code_set, visit_end_datetime
FROM
    CHCO_DeID_Oct2018.visit_occurrence v
WHERE
    DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM CHCO_DeID_Oct2018.visit_occurrence), INTERVAL -5 YEAR)

QUERY 2: 21.3 seconds returning 5406779 rows

SELECT v.person_id, 'fuzzy_syn_desc_stand_none_desc' AS code_set
FROM 
  CHCO_DeID_Oct2018.visit_occurrence v
GROUP BY 
  v.person_id, v.visit_end_datetime 
HAVING
  DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM CHCO_DeID_Oct2018.visit_occurrence), INTERVAL -5 YEAR)
callahantiff commented 5 years ago

OK, unless you want me to switch out GROUP BY for DISTINCT, I think this query is good to go. I will close this for now, but please re-open if you disagree!