Closed callahantiff closed 5 years ago
One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient):
WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),
In dx_case_inclusion_criteria1 (and 2), you use admittingSOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions.
Same simplification as used in Age example above using DISTINCT. I won't show any more of these. GROUP BY is an expensive operation and it has some resource limitations on GBQ for large queries (it ends up having to be executed on a single node. So eliminating the HAVING where at least one instance is acceptable in the criteria is a good thing. You cannot use this simplification if you need to count two or more like you need in your DX visits criteria.
rx_case_inclusion_criteria_1 AS ( SELECT DISTINCT de.person_id, cohort.standard_code_set AS code_set FROM {database}.drug_exposure de JOIN {database}.ADHD_COHORT_VARS cohort ON (de.drug_concept_id = CAST(cohort.standard_concept_id AS int64)) WHERE cohort.phenotype_definition_number = 2 AND cohort.standard_code_set = {code_set_group} ),
I don't see what this clause is buying you: COUNT(DISTINCT co.condition_concept_id) >= 1 You will have at least one condition_concept_id else your JOIN will return NULL and your other COUNT clause will fail as you want.
dx_case_inclusion_criteria_2 AS ( SELECT co.person_id, cohort.standard_code_set AS code_set FROM {database}.condition_occurrence co JOIN {database}.visit_occurrence v ON (co.visit_occurrence_id = v.visit_occurrence_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) AND cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} GROUP BY co.person_id, cohort.standard_code_set HAVING COUNT(DISTINCT co.condition_concept_id) >= 1 AND COUNT(DISTINCT v.visit_start_date) > 2 ),
One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient):
WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),
For this comment and this one -- I had originally used DISTINCT
instead of GROUP BY
, but noticed that it was taking forever. In some stackoverflow post, I found someone had benchmarked a comparison to the same queries using each approach. When I switched to using GROUP BY
instead, my queries all ran much faster. Are you suggesting I switch back?
I don't see what this clause is buying you: COUNT(DISTINCT co.condition_concept_id) >= 1 You will have at least one condition_concept_id else your JOIN will return NULL and your other COUNT clause will fail as you want.
dx_case_inclusion_criteria_2 AS ( SELECT co.person_id, cohort.standard_code_set AS code_set FROM {database}.condition_occurrence co JOIN {database}.visit_occurrence v ON (co.visit_occurrence_id = v.visit_occurrence_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) AND cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} GROUP BY co.person_id, cohort.standard_code_set HAVING COUNT(DISTINCT co.condition_concept_id) >= 1 AND COUNT(DISTINCT v.visit_start_date) > 2 ),
Very true, I was just trying to be very explicit in the syntax. Do you suggest I remove it?
Simplify visit_criteria_1 as with the others. Also, rather than DATETIME_ADD can't you use DATEDIFF, which may be easier to understand (does DATEDIFF work with DATETIMEs?):
WHERE DATEDIFF(visit_end_datetime,DATE(SELECT DATETIME(MAX(visit_end_datetime)) from {database}.visit_occurrence),YEAR) >=5
visit_criteria_1 AS ( SELECT DISTINCT v.person_id, {code_set_group} AS code_set FROM {database}.visit_occurrence v WHERE DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM {database}.visit_occurrence), INTERVAL -5 YEAR) ),
In dx_case_inclusion_criteria1 (and 2), you use admittingSOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions.
I was trying as best as possible to stick with OHDSI conventions for these queries. I also thought it would be more consistent when applying the same queries to the OMOP-MIMIC data. Are you OK with me sticking with this approach?
Your final logic statement -- as before you can can the GROUP BY and replace the SELECT * with SELECT DISTINCT ......
My example for your CONTROLS allows you to keep the code_set
How do I add comments to your comments? Does replying to your emails add this to the comment? From: "Tiffany J. Callahan" notifications@github.com Reply-To: callahantiff/PheKnowVec reply@reply.github.com Date: Sunday, August 4, 2019 at 3:05 PM To: callahantiff/PheKnowVec PheKnowVec@noreply.github.com Cc: "Kahn, Michael" Michael.Kahn@ucdenver.edu, Mention mention@noreply.github.com Subject: Re: [callahantiff/PheKnowVec] Query Verification: ADHD Cohort (#100) In dx_case_inclusion_criteria_1 (and 2), you use admitting_SOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions. I was trying as best as possible to stick with OHDSI conventions for these queries. I also thought it would be more consistent when applying the same queries to the OMOP-MIMIC data. Are you OK with me sticking with this approach? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#100?email_source=notifications&email_token=AA557TS6SYVITPQIFQW6UBLQC477VA5CNFSM4IF3B2XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3QJVDI#issuecomment-518036109>, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AA557TQWGTUCFPJAIRDDN4LQC477VANCNFSM4IF3B2XA.
Can either reply via an email or right click on the three dots in the upper-right corner of the comment you want to reply to (see screenshot).
My example for your CONTROLS allows you to keep the code_set
The example sent does not include the code_set_group variable and instead is returning the name of the logic chunk. I also was having a hard time getting it to run. Sorry if I am doing something wrong.
One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient): WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),
For this commnt and this one -- I had originally used
DISTINCT
instead ofGROUP BY
, but noticed that it was taking forever. In some stackoverflow post, I found someone had benchmarked a comparison to the same queries using each approach. When I switched to usingGROUP BY
instead, my queries all ran much faster. Are you suggesting I switch back?
INTERESTING. COUNTER_INTUTIVE and I think HDC has lots of issues with running out of resources when using GROUP BY. Can you re-discover that post? I'd like to show the HDC folks. The two approaches are semantically the same so use whichever one runs best. I thought that the not GROUP BY version would be much faster because it could be parallelized more and I thought HDC had issues with GROUP BY. I KNOW we have issues with ORDER BY requiring the entire query to run on a single GBQ node.
I don't see what this clause is buying you: COUNT(DISTINCT co.condition_concept_id) >= 1 You will have at least one condition_concept_id else your JOIN will return NULL and your other COUNT clause will fail as you want. dx_case_inclusion_criteria_2 AS ( SELECT co.person_id, cohort.standard_code_set AS code_set FROM {database}.condition_occurrence co JOIN {database}.visit_occurrence v ON (co.visit_occurrence_id = v.visit_occurrence_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE v.admitting_source_concept_id IN (44814672, 8870, 38004279, 8863, 8546) AND cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} GROUP BY co.person_id, cohort.standard_code_set HAVING COUNT(DISTINCT co.condition_concept_id) >= 1 AND COUNT(DISTINCT v.visit_start_date) > 2 ),
Very true, I was just trying to be very explicit in the syntax. Do you suggest I remove it?
I would have said "yes" but it seems GROUP BY are not as costly as I have believed so I'm agnostic.
In dx_case_inclusion_criteria1 (and 2), you use admittingSOURCE_concept_id for visit types (the SOURCE_concept_id). Is that because you are not expecting to find the non-standard type codes in the concept_id field? For PEDSnet (but not OHDSI), non-standard codes were allowed in concept_ids. But I understand if you want to stick to OHDSI conventions.
I was trying as best as possible to stick with OHDSI conventions for these queries. I also thought it would be more consistent when applying the same queries to the OMOP-MIMIC data. Are you OK with me sticking with this approach?
Sure. Just wanted to confirm reasoning.
One way of simplifying the Age query using DISTINCT in SELECT to do the same job as the GROUP BY (will be much more efficient): WITH age_criteria_1 AS ( SELECT DISTINCT co.person_id, cohort.standard_code_set AS code_set FROM {database}.person p JOIN {database}.condition_occurrence co ON (p.person_id = co.person_id) JOIN {database}.ADHD_COHORT_VARS cohort ON (co.condition_concept_id = CAST(cohort.standard_concept_id as int64)) WHERE cohort.phenotype_definition_number = 1 AND cohort.standard_code_set = {code_set_group} AND DATETIME_DIFF(DATETIME(co.condition_start_datetime), DATETIME(p.birth_datetime), DAY) >= 1460 ),
For this commnt and this one -- I had originally used
DISTINCT
instead ofGROUP BY
, but noticed that it was taking forever. In some stackoverflow post, I found someone had benchmarked a comparison to the same queries using each approach. When I switched to usingGROUP BY
instead, my queries all ran much faster. Are you suggesting I switch back?INTERESTING. COUNTER_INTUTIVE and I think HDC has lots of issues with running out of resources when using GROUP BY. Can you re-discover that post? I'd like to show the HDC folks. The two approaches are semantically the same so use whichever one runs best. I thought that the not GROUP BY version would be much faster because it could be parallelized more and I thought HDC had issues with GROUP BY. I KNOW we have issues with ORDER BY requiring the entire query to run on a single GBQ node.
OK, I believe this is the link that I found. I acknowledge it's not GBQ-specific, but I do know that I was unable to get my query to complete using DISTINCT
and when I switch to GROUP BY
it went WAY faster.
Simplify visit_criteria_1 as with the others. Also, rather than DATETIME_ADD can't you use DATEDIFF, which may be easier to understand (does DATEDIFF work with DATETIMEs?):
WHERE DATEDIFF(visit_end_datetime,DATE(SELECT DATETIME(MAX(visit_end_datetime)) from {database}.visit_occurrence),YEAR) >=5
visit_criteria_1 AS ( SELECT DISTINCT v.person_id, {code_set_group} AS code_set FROM {database}.visit_occurrence v WHERE DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM {database}.visit_occurrence), INTERVAL -5 YEAR) ),
I am not finding a better option using DATEDIFF
. I wanted to share the results generated using DISTINCT
(Query 1) versus GROUP BY
(QUERY 2):
QUERY 1: 22.5 seconds returning 5406779 rows
SELECT DISTINCT v.person_id, 'fuzzy_syn_desc_stand_none_desc' AS code_set, visit_end_datetime
FROM
CHCO_DeID_Oct2018.visit_occurrence v
WHERE
DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM CHCO_DeID_Oct2018.visit_occurrence), INTERVAL -5 YEAR)
QUERY 2: 21.3 seconds returning 5406779 rows
SELECT v.person_id, 'fuzzy_syn_desc_stand_none_desc' AS code_set
FROM
CHCO_DeID_Oct2018.visit_occurrence v
GROUP BY
v.person_id, v.visit_end_datetime
HAVING
DATETIME(visit_end_datetime) >= DATETIME_ADD((SELECT DATETIME(MAX(visit_end_datetime)) FROM CHCO_DeID_Oct2018.visit_occurrence), INTERVAL -5 YEAR)
OK, unless you want me to switch out GROUP BY
for DISTINCT
, I think this query is good to go. I will close this for now, but please re-open if you disagree!
@mgkahn - Can you please help me verify the query to select ADHD patients?
COHORT CRITERIA Case Criteria:
CASE TYPE 1
visit_admitting_source_concept_id
):44814672
)8870
)38004279
)8863
)8546
)CASE TYPE 2
visit_admitting_source_concept_id
):44814672
)8870
)38004279
)8863
)8546
)Control Criteria:
visit_end_date
in EHR)Cohort Logic Table
NOTE.
{database}
withCHCO_DeID_Oct2018
{code_set_group}
Query can be found here and is also included below: