callahantiff / OMOP2OBO

OMOP2OBO: A Python Library for mapping OMOP standardized clinical terminologies to Open Biomedical Ontologies
http://tiffanycallahan.com/OMOP2OBO_Dashboard
MIT License
83 stars 12 forks source link

SQL Verification - OMOP Coverage V1 #26

Closed callahantiff closed 4 years ago

callahantiff commented 4 years ago

PURPOSE: This query is designed to query an OMOP instance and return the 6 columns listed below. This query makes the assumption that the other shops would be willing to return some results to us, rather than calculating coverage statistics locally (that version will be coming next).

QUERY TYPE: OMOP Coverage V1 RUNTIME: 19.4 seconds
RESULTS: 39,910 rows (i.e. unique CONCEPT_IDs)

TASK: @mgkahn - What do you think about this?

WITH 
  condition_concepts
  AS (SELECT
        c.condition_concept_id AS CONCEPT_ID,
        v.vocabulary_version AS VOCABULARY_VERSION,
        COUNT(DISTINCT c.visit_occurrence_id) AS VISIT_COUNT,
        COUNT(DISTINCT c.person_id) AS PATIENT_COUNT
      FROM 
        CHCO_DeID_Oct2018.condition_occurrence c 
        JOIN CHCO_DeID_Oct2018.concept c1 ON c.condition_concept_id = c1.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c1.vocabulary_id = v.vocabulary_id
      WHERE 
        c1.concept_name != "No matching concept" 
        AND c1.domain_id = "Condition"
      GROUP BY CONCEPT_ID, VOCABULARY_VERSION),

  measurement_concepts
  AS (SELECT
        m.measurement_concept_id AS CONCEPT_ID,
        v.vocabulary_version AS VOCABULARY_VERSION,
        COUNT(DISTINCT m.visit_occurrence_id) AS VISIT_COUNT,
        COUNT(DISTINCT m.person_id) AS PATIENT_COUNT
      FROM 
        CHCO_DeID_Oct2018.measurement m 
        JOIN CHCO_DeID_Oct2018.concept c ON m.measurement_concept_id = c.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c.vocabulary_id = v.vocabulary_id
      WHERE 
        c.concept_name != "No matching concept" 
        AND c.domain_id = "Measurement"
      GROUP BY CONCEPT_ID, VOCABULARY_VERSION),

  drug_concepts
  AS (SELECT
        d.drug_concept_id AS CONCEPT_ID,
        v.vocabulary_version AS VOCABULARY_VERSION,
        COUNT(DISTINCT d.visit_occurrence_id) AS VISIT_COUNT,
        COUNT(DISTINCT d.person_id) AS PATIENT_COUNT
      FROM 
        CHCO_DeID_Oct2018.drug_exposure d 
        JOIN CHCO_DeID_Oct2018.concept c ON d.drug_concept_id = c.concept_id
        JOIN CHCO_DeID_Oct2018.vocabulary v ON c.vocabulary_id = v.vocabulary_id
      WHERE 
        c.concept_name != "No matching concept" 
        AND c.domain_id = "Drug"
      GROUP BY CONCEPT_ID, VOCABULARY_VERSION)

SELECT *
FROM
  (SELECT CONCEPT_ID, VOCABULARY_VERSION, VISIT_COUNT, PATIENT_COUNT FROM condition_concepts
    UNION DISTINCT
   SELECT CONCEPT_ID, VOCABULARY_VERSION, VISIT_COUNT, PATIENT_COUNT FROM measurement_concepts
    UNION DISTINCT
   SELECT CONCEPT_ID, VOCABULARY_VERSION, VISIT_COUNT, PATIENT_COUNT FROM drug_concepts);


QUERY FILES:

SQLRender Query Templates:

callahantiff commented 4 years ago

@mgkahn - I'm not entirely sure I understand what's need to utilize SQLRender after reading the documentation.

I see how the library can be used to create generalizable SQL templates, but I'm not sure how this is applied to our query, which does not need to be templated since we are using OMOP. It could also be that the functions that we need are not available in their list of translatable functions (here).

I'm hoping we can talk about this during our meeting tomorrow.

mgkahn commented 4 years ago

SQL Render: At the very least, you will have parameters for the local names of the database and the database schema that will varying by site. Also, I suspect "UNION DISTINCT" is GBQ specific syntax that the SQL Render syntax will expand to the correct syntax. The rest of the syntax looks pretty vanilla to me.

mgkahn commented 4 years ago

Also. To minimize resources/file size -- You may want to remove fields in your final SQL that you can recreate when you get the site file back. So only concept_id. You don't need source_code, label, domain as you can get them back with just the concept_id.

callahantiff commented 4 years ago

SQL Render: At the very least, you will have parameters for the local names of the database and the database schema that will varying by site. Also, I suspect "UNION DISTINCT" is GBQ specific syntax that the SQL Render syntax will expand to the correct syntax. The rest of the syntax looks pretty vanilla to me.

Good call. I didn't think about different DB names. I will work on this script.

callahantiff commented 4 years ago

Also. To minimize resources/file size -- You may want to remove fields in your final SQL that you can recreate when you get the site file back. So only concept_id. You don't need source_code, label, domain as you can get them back with just the concept_id.

Great suggestion!

callahantiff commented 4 years ago

OK, the updated SQL is shown above. OK with this? I added an additional column to return the version

mgkahn commented 4 years ago

I had intended to ask this question before: Do you want to know if folks are putting non-standard codes into their concept_ids? I know this isn't OHDSI "kosher" but do you want to know if this is happening "in the real world"? If so, you would remove all of the standard_concept = 'S' clauses. Else if you only wanted to focus on the standard concepts, leave it in. My recommendation is to take standard_concept = 'S' out. You can alway impose that restriction on the set of concept_ids you get back on your side. And, it would give you insights into if people are "cheating" in actual practice.

callahantiff commented 4 years ago

I had intended to ask this question before: Do you want to know if folks are putting non-standard codes into their concept_ids? I know this isn't OHDSI "kosher" but do you want to know if this is happening "in the real world"? If so, you would remove all of the standard_concept = 'S' clauses. Else if you only wanted to focus on the standard concepts, leave it in. My recommendation is to take standard_concept = 'S' out. You can alway impose that restriction on the set of concept_ids you get back on your side. And, it would give you insights into if people are "cheating" in actual practice.

OK, I agree with removing the c1.standard_concept = "S" logic. I think getting a wider set and having the option to restrict it after the fact, like you mention, is a great idea. I will update all of the queries to match. The updated script for this issue is shown above. Any other changes you'd would like made or should I contact Andrew?

callahantiff commented 4 years ago

Good morning! @mgkahn - just circling back. Are you OK with this? Can I close this issue and reach out to our first contact to begin the coverage studies?

mgkahn commented 4 years ago

Yes. Remember to send out the SQLRender version, not the one here with the CU-AMC specific project information.

callahantiff commented 4 years ago

Absolutely. Closing this now. Thank you!