callahantiff / PheKnowVec

Translational Computational Phenotyping
2 stars 0 forks source link

Query Verification: Crohn's Disease Cohort #102

Open callahantiff opened 5 years ago

callahantiff commented 5 years ago

@mgkahn - Can you please help me verify the query to select Crohn's Disease patients?


COHORT CRITERIA Case Criteria:
CASE TYPE 1: Crohn’s and medications only

CASE TYPE 2: Crohn’s and medications with Ulcerative Colitis Diagnosis


Control Criteria:



Cohort Logic Table

COHORT CHUNK LOGICAL OPERATOR
CASE TYPE 1 dx_case_inclusion_criteria_1 AND
CASE TYPE 1 rx_case_inclusion_criteria_1 AND
CASE TYPE 1 dx_case_exclusion_criteria_1 ---
CASE TYPE 2 dx_case_inclusion_criteria_1 AND
CASE TYPE 2 rx_case_inclusion_criteria_1 AND
CASE TYPE 2 dx_case_inclusion_criteria_2 ---
CONTROL dx_control_exclusion_criteria_1 AND
CONTROL rx_control_exclusion_criteria_1 AND
CONTROL dx_control_exclusion_criteria_2 AND
CONTROL dx_control_exclusion_criteria_3 AND
CONTROL dx_control_exclusion_criteria_4 ---

NOTE.

  1. Replace all instances of {database} with CHCO_DeID_Oct2018
  2. Comment out all instances of {code_set_group}

Query can be found here and is also included below:

WITH dx_case_inclusion_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 1
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT co.condition_concept_id) > 2
),

rx_case_inclusion_criteria_1 AS (
SELECT de.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.drug_exposure de, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  de.drug_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 2 
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  de.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT de.drug_concept_id) >= 1
),

dx_case_exclusion_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set
FROM 
  {database}.condition_occurrence co,
  {database}.CROHNSDISEASE_COHORT_VARS cohort
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64)
  AND cohort.phenotype_definition_number = 3
  AND cohort.standard_code_set = {code_set_group}
GROUP BY co.person_id, cohort.standard_code_set
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1
),

dx_case_inclusion_criteria_2 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 4
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT co.condition_concept_id) > 2
),

dx_control_exclusion_criteria_1 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 5
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT co.condition_concept_id) > 2
),

rx_control_exclusion_criteria_1 AS (
SELECT de.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.drug_exposure de, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  de.drug_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 6
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  de.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT de.drug_concept_id) >= 1
),

dx_control_exclusion_criteria_2 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 7
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT co.condition_concept_id) > 2
),

dx_control_exclusion_criteria_3 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 8
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1
),

dx_control_exclusion_criteria_4 AS (
SELECT co.person_id, cohort.standard_code_set AS code_set 
FROM 
  {database}.condition_occurrence co, 
  {database}.CROHNSDISEASE_COHORT_VARS cohort 
WHERE 
  co.condition_concept_id = CAST(cohort.standard_concept_id AS int64) 
  AND cohort.phenotype_definition_number = 9
  AND cohort.standard_code_set = {code_set_group}
GROUP BY 
  co.person_id, cohort.standard_code_set 
HAVING 
  COUNT(DISTINCT co.condition_concept_id) >= 1
)

SELECT * FROM
  (SELECT person_id, code_set, 'CROHNSDISEASE_CASE_TYPE1' AS cohort_type
    FROM (
    (SELECT person_id, code_set FROM dx_case_inclusion_criteria_1
      INTERSECT DISTINCT 
     SELECT person_id, code_set FROM rx_case_inclusion_criteria_1)
      EXCEPT DISTINCT
     SELECT person_id, code_set FROM dx_case_exclusion_criteria_1)
   GROUP BY person_id, code_set, cohort_type)

  UNION ALL

  (SELECT person_id, code_set, 'CROHNSDISEASE_CASE_TYPE2' AS cohort_type
    FROM (
     SELECT person_id, code_set FROM dx_case_inclusion_criteria_1
      INTERSECT DISTINCT 
     SELECT person_id, code_set FROM rx_case_inclusion_criteria_1
      INTERSECT DISTINCT 
     SELECT person_id, code_set FROM dx_case_inclusion_criteria_2)
   GROUP BY person_id, code_set, cohort_type)

  UNION ALL
  (SELECT p.person_id, {code_set_group} AS code_set, 'CROHNSDISEASE_CONTROL' AS cohort_type
    FROM {database}.person p
    WHERE p.person_id NOT IN (
      SELECT person_id FROM dx_control_exclusion_criteria_1
        INTERSECT DISTINCT
       SELECT person_id FROM rx_control_exclusion_criteria_1
        INTERSECT DISTINCT
       SELECT person_id FROM dx_control_exclusion_criteria_2)
        INTERSECT DISTINCT
       SELECT person_id FROM dx_control_exclusion_criteria_3
        INTERSECT DISTINCT
       SELECT person_id FROM dx_control_exclusion_criteria_4)
  GROUP BY p.person_id, code_set, cohort_type)
;
mgkahn commented 5 years ago

Edited to reflect "three or more" (>2) not 2.

dx_case_inclusion_criteria_1: The SQL is written as if the criteria required the 3 or more occurrences to be different inclusion diagnoses. Another interpretation of the same English is that the same/different inclusion diagnosis appears in three different encounters. If the second sentence is the intended interpretation, then you need to bring in three visit_occurrences (v1, v2,v3), make sure then are not equal (e.g. V1< V2, V2 < V3) and look at their linked diagnoses independently as I sketch out with a previous phenotype that needed two visits 30 days apart.

mgkahn commented 5 years ago

dx_case_exclusion_criteria_1: I do not see justification for HAVING COUNT(DISTINCT co.condition_concept_id) > 1

The English ("any dx") implies that any instant of an exclusion criteria is qualifying. If so, a count >= 1 (including the equals) is correct.

mgkahn commented 5 years ago

dx_case_inclusion_criteria_2 Same question as previous. The current query looks for two distinct UC codes rather than the same UC code at two different visits. Make sure the query is the intended semantics of the English.

CASE TYPE 2 is a confusing case definition. CASE TYPE 1 explicitly removes patients with any UC diagnosis yet CASE TYPE 2 explicitly puts them back in. Between these two definitions, the only UC cases not included have only 1 UC Dx.

Is this the intended behavior of the combination of the two inclusion types? Seems so by the text after "CASE TYPE 2" but weird.

mgkahn commented 5 years ago

If I followed the logic for controls correctly, the query would allow a person to be declared a control if they had 0-1 relevant codes, 1 relevant medication, 0-1 ulcerative colitis codes and none of the exclusion_criteria_3 and _4 codes. Without looking at what is in exclusion_criteria_3 and _4 codes, this logic would allow somebody who had some "hints" about Crohn's and UC to be included.

mgkahn commented 5 years ago

End of comments for Crohn's

callahantiff commented 5 years ago

Edited to reflect "three or more" (>2) not 2.

dx_case_inclusion_criteria_1: The SQL is written as if the criteria required the 3 or more occurrences to be different inclusion diagnoses. Another interpretation of the same English is that the same/different inclusion diagnosis appears in three different encounters. If the second sentence is the intended interpretation, then you need to bring in three visit_occurrences (v1, v2,v3), make sure then are not equal (e.g. V1< V2, V2 < V3) and look at their linked diagnoses independently as I sketch out with a previous phenotype that needed two visits 30 days apart.

This is the only information we are given:

image

Which, means my initial line of >2 was OK, right?

callahantiff commented 5 years ago

dx_case_exclusion_criteria_1: I do not see justification for HAVING COUNT(DISTINCT co.condition_concept_id) > 1

The English ("any dx") implies that any instant of an exclusion criteria is qualifying. If so, a count >= 1 (including the equals) is correct.

OK, I agree with that. In that case, I think I need to verify all exclusion criteria for "any occurrence" and make sure it matches this.

callahantiff commented 5 years ago

dx_case_inclusion_criteria_2 Same question as previous. The current query looks for two distinct UC codes rather than the same UC code at two different visits. Make sure the query is the intended semantics of the English.

CASE TYPE 2 is a confusing case definition. CASE TYPE 1 explicitly removes patients with any UC diagnosis yet CASE TYPE 2 explicitly puts them back in. Between these two definitions, the only UC cases not included have only 1 UC Dx.

Is this the intended behavior of the combination of the two inclusion types? Seems so by the text after "CASE TYPE 2" but weird.

Here is the information I am provided:

Screen Shot 2019-08-06 at 23 06 49

Going off of this, like the first example, I believe it is represented correctly. If you still disagree, I'm happy to discuss further. I tried hard to only go off of what the text indicated and not include additional knowledge as that should represent what an "average" person might interpret. Or, at least that was my goal in doing this in the most reproducible manner.

callahantiff commented 5 years ago

If I followed the logic for controls correctly, the query would allow a person to be declared a control if they had 0-1 relevant codes, 1 relevant medication, 0-1 ulcerative colitis codes and none of the exclusion_criteria_3 and _4 codes. Without looking at what is in exclusion_criteria_3 and _4 codes, this logic would allow somebody who had some "hints" about Crohn's and UC to be included.

So, for the control criteria, the definition doc states:

image

Where "none of the above" refers to the case criteria and the "also excludes" is an additional list of ICD codes mentioned in dx_control_exclusion_criteria_3 and keywords I have included in dx_control_exclusion_criteria_4. Given this, do you agree with the way I have written the control query?

The full doc is here if you want more than than what I have included in this and the other issues.

mgkahn commented 5 years ago

So weird..... Do they provide any examples of implementations of this phenotype on the web site so we can see how others have implemented these criteria?

callahantiff commented 5 years ago

So weird..... Do they provide any examples of implementations of this phenotype on the web site so we can see how others have implemented these criteria?

This goes back to a fun conversation we had at very beginning of this project 😄. In general, most of the phenotypes we are exploring have been implemented by several universities, but the results of these implantations are not publicly available, although you can gain access to them if you create an account.

All that said, the point of this project (in my opinion) is to demonstrate what an average, unbiased approach built on top of OMOP looks like. While I understand wanting to go see how others did it, we also know others have made mistakes and how are we to know you is "right". Further, if we peek for this phenotype, then we would need to do that for the others too otherwise we are introducing more bias. I'd prefer to write a paper that shows a realistic perspective on what it's like to implement eMERGE phenotypes. I strongly believe that it's the only way we will be able to understand how these things can be automated in the future.

mgkahn commented 5 years ago

NEW COMMENT: Logic for exclusion criteria is wrong. You need an OR (UNION) rather than an AND. Current logics says that an exclusion needs to meet all exclusion criteria. My understanding is that you are excluded if you meet ANY exclusion criteria. If so, then an OR (UNION) is the right logic in your NOT IN subquery.

callahantiff commented 5 years ago

NEW COMMENT: Logic for exclusion criteria is wrong. You need an OR (UNION) rather than an AND. Current logics says that an exclusion needs to meet all exclusion criteria. My understanding is that you are excluded if you meet ANY exclusion criteria. If so, then an OR (UNION) is the right logic in your NOT IN subquery.

Thanks for bringing this up, this was something that I went back and forth on too. The criteria that we were provided says that the control group can contain "none" of the case group criteria as well as some other criteria. I interpreted that as meaning they wanted to identify people that were the opposite of the case group in that they should have none of the criteria that was used to select cases. Do you disagree? I am happy to do whatever you think is best, but wanted to explain why I chose "AND" and not "OR".