OHDSI / FeatureExtraction

An R package for generating features (covariates) for a cohort using data in the Common Data Model.
http://ohdsi.github.io/FeatureExtraction/
60 stars 58 forks source link

FeatureExtraction2 - Support place_of_service_concept_id #33

Closed gowthamrao closed 4 years ago

gowthamrao commented 6 years ago

@schuemie

place_of_service_concept_id is on the care_site table. Visit_occurrence and visit_detail may be joined to care_site using FK care_site_id. Condition_occurrence, procedure_occurrence, drug_occurrence may be joined to care_site thru visit_occurrence table.

Use-case: counts at home, emergency room, in-patient, out-patient, doctor's office. These are either concepts or concept-sets/covariates of place_of_service_concept_id.

I think using DomainConcept.sql would make it complex - e.g. in DomainConcept.sql we currently have

    FROM @cohort_table cohort
    INNER JOIN @cdm_database_schema.@domain_table
        ON cohort.subject_id = @domain_table.person_id

when @domain_table is visit_occurrence it needs to become something like

    FROM @cohort_table cohort
    INNER JOIN @cdm_database_schema.@domain_table
        ON cohort.subject_id = @domain_table.person_id
        INNER JOIN @cdm_database_schema.care_site
               ON @domain_table.care_site_id = care_site.care_site_id

when @domain_table is condition_occurrence it needs to become something like

    FROM @cohort_table cohort
    INNER JOIN @cdm_database_schema.@domain_table
        ON cohort.subject_id = @domain_table.person_id
        INNER JOIN @cdm_database_schema.visit_occurrence
               ON @domain_table.visit_occurrence_id = visit_occurrence.visit_occurrence_id
        INNER JOIN @cdm_database_schema.care_site
               ON visit_occurrence.care_site_id = care_site.care_site_id

What would be the better way of handling place_of_service_concept_id?

schuemie commented 6 years ago

Hi @gowthamrao ,

Just trying to understand what the feature is that you'd like to construct. Do you want to know the unique list of places of service where a person has received care? Or do you want one list where they got their drugs, another where they were diagnosed? Or do you want a list for every drug-place combination, every visit-place combination, etc.?

gowthamrao commented 6 years ago

@schuemie

Do you want to know the unique list of places of service where a person has received care?

we want count of visit records per place of service e.g. number of unique visits per person to doctor's office

Or do you want one list where they got their drugs, another where they were diagnosed?

yes - number of records in drug_occurrence at doctor's office. e.g. number of unique visits to doctors office per person where chemotherapy was administered or number of records in visit_occurrence where primary diagnosis was 'influenza' or number of unique visits to with inpatient place of service per person where primary diagnosis was 'chest pain' and procedure was 'cardiac catheterization'

Or do you want a list for every drug-place combination, every visit-place combination, etc.?

no

gowthamrao commented 6 years ago

@schuemie - i was thinking of something like below, where @place_of_service will be a new optional parameter in R-function call of FeatureExtraction2.

ConceptCount.sql

SELECT subject_id,
    cohort_start_date,
{@temporal} ? {
    time_id,
}
    concept_count
INTO #raw_data
} : {
SELECT 1000 + @analysis_id AS covariate_id,
{@temporal} ? {
    time_id,
}
    row_id,
    concept_count AS covariate_value
INTO @covariate_table
}
FROM (
    SELECT
{@temporal} ? {
        time_id,
}
{@aggregated} ? {
        subject_id,
        cohort_start_date,
} : {
        cohort.@row_id_field AS row_id,
}
{@sub_type == 'distinct'} ? {
        COUNT(DISTINCT @domain_concept_id) AS concept_count
} : {
        COUNT(*) AS concept_count
}
    FROM @cohort_table cohort
    INNER JOIN @cdm_database_schema.@domain_table
        ON cohort.subject_id = @domain_table.person_id
{@place_of_service} ? {
    {@domain_table == 'visit_occurrence'} ? {
        INNER JOIN @cdm_database_schema.care_site care_site
             ON @domain_table.care_site_id = care_site.care_site_id
    }
    {@domain_table %in% ('condition_occurrence', 'device_exposure', 'drug_exposure', 'measurement','observation','procedure_occurrence')} ? {
        INNER JOIN @cdm_database_schema.visit_occurrence visit_occurrence
             ON @domain_table.visit_occurrence_id = visit_occurrence.visit_occurrence_id
        INNER JOIN @cdm_database_schema.care_site care_site
             ON visit_occurrence.care_site_id = care_site.care_site_id
    }
}
{@temporal} ? {
    INNER JOIN #time_period time_period
        ON @domain_start_date <= DATEADD(DAY, time_period.end_day, cohort.cohort_start_date)
        AND @domain_end_date >= DATEADD(DAY, time_period.start_day, cohort.cohort_start_date)
    WHERE @domain_concept_id != 0
} : {
    WHERE @domain_start_date <= DATEADD(DAY, @end_day, cohort.cohort_start_date)
        AND @domain_end_date >= DATEADD(DAY, @start_day, cohort.cohort_start_date)
        AND @domain_concept_id != 0
}
{@excluded_concept_table != ''} ? {
        AND @domain_concept_id NOT IN (SELECT id FROM @excluded_concept_table)
        {@place_of_service} ? {
            AND care_site.place_of_service_concept_id NOT IN (SELECT id from @excluded_concept_table)
        }
}
{@included_concept_table != ''} ? {
        AND @domain_concept_id IN (SELECT id FROM @included_concept_table)
        {@place_of_service} ? {
            AND care_site.place_of_service_concept_id IN (SELECT id from @included_concept_table)
        }
}
{@cohort_definition_id != -1} ? {       AND cohort.cohort_definition_id = @cohort_definition_id}
    GROUP BY
{@temporal} ? {
        time_id,
}
{@aggregated} ? {
        subject_id,
        cohort_start_date
} : {
        cohort.@row_id_field
}
    ) raw_data;
schuemie commented 6 years ago

Just checking: do you really mean place_of_service_concept_id, or are you actually thinking of visit_concept_id? The first can have many values, like:

concept_id concept_name
8782 Urgent Care Facility
8809 Independent Laboratory
8827 Custodial Care Facility
8844 Other Place of Service
8850 Ambulance - Air or Water
8851 Group Home
8858 Mass Immunization Center

The second can have only five different values:

concept_id concept_name
9201 Inpatient Visit
42898160 Long Term Care Visit
9203 Emergency Room Visit
9202 Outpatient Visit
262 Emergency Room and Inpatient Visit
gowthamrao commented 6 years ago

Definitely place_of_service_concept_id. visit_concept_id is a derived field that is calculated during the ETL process, most source data have place_of_service_concept_id. http://www.ohdsi.org/web/atlas/#/conceptset/1655609/details

e.g. visit_concept_id has 'Outpatient Visit', but it includes many place_of_service_concept_id's like

Concept ID Concept Code Concept Name Standard Concept
8883 24 Ambulatory Surgical Center S
8716 49 Independent Clinic S
8940 11 Office S
8977 71 Public Health Clinic S
8761 72 Rural Health Clinic S
8782 20 Urgent Care Facility S
38003620 17 Walk---in Retail Health Clinic S

using just visit_concept_id is lossy. e.g. we want to separately calculate 8883, and 8940, and combination of 8863 (Skilled Nursing Facility) & 8676 (Nursing Facility). We want to calculate at both concept_id level and combination of concept_id's

gowthamrao commented 6 years ago

Use case -- if we are predicting admission to hospital outcome for a cohort of diabetics, we would like to use the feature - number of 8940 place of service visit occurrence in long-term period prior to cohort entry.

davekern commented 6 years ago

I'll second @gowthamrao suggestion. Definitely would be useful to have by specific place_of_service_concept_id, especially for the outpatient category in visit_concept_id. And, if possible, to create an "all other outpatient" category based on those that haven't been broken out on their own. For example if I want to look at Office visits (8940) and Nursing facility stays (8863 & 8676) I'd also want to see all other outpatient services not included in either of those buckets.

schuemie commented 6 years ago

@gowthamrao : do you want to take a stab at implementing the proposed changes? I've just brought the develop branch up-to-date, we can land your pull-requests there and work on debugging the code.

schuemie commented 6 years ago

Hi @gowthamrao . Sorry, took me a while to make time to take a thorough look at your code. I have some questions and comments:

In general:

I thought the idea was to have counts per place_of_service_concept_id? Currently this code is just counting visits that have some (combination of) type of records associated with them.

Also, there are currently 128 new analyses which each generate exactly 1 covariate ID. Our analysis ID space is limited to 3 digits while our covariate ID space is virtually limitless. Maybe we could group some of these analyses?

More specific:

What is the difference between analysis 120 (VisitOccurrenceRecordCountsLongTerm) and analysis 917 (VisitCountLongTerm)?

I'm having trouble understand what the purpose of analysis 121 (VisitOccurrenceRecordCountsByPosLongTerm) is. What the code does is count the number of visits in the window that are associated with care sites that do not have care_site_id = 0. What is the significance of care_site_id = 0?

Similarly, what is the purpose of analysis 122 (VisitOccurrenceRecordCountsByCoLongTerm)? This counts the number of visits in the window that are associated with a condition_occurrence, where the visit_occurrence_id is not 0. visit_occurrence_id is an auto-increment number. What is the significance of visit_occurrence_id = 0?

Comments and questions specific to RecordCounts.sql:

line 141: please don't hardcode analysis IDs. What if someone want to use different analysis IDs? This is precisely what the sub_type field is for (which you didn't use).

line 142: @domain_table_id only works because @domain_table is replaced with the table name. This is a bit confusing, because it suggests there is a variable called @domain_table_id. Why not simply count the number of records using COUNT(*)?

line 240: why would @domain_table_id be 0? (It is an auto-increment number). What is the significance of @domain_table_id = 0? Do you mean @domain_concept_id?

line 246: Here's the only difference I see between analysis 120 and 917: analysis 917 includes visits where the end date is after the window start, even though the visit start might be before the window start. In contrast, analysis 120 only considered visits that start in the window. Is the distinction important? If it is, this is something to explicitly mention in the descriptions.

Line 369-382: Having just 2 different covariate names for 128 different covariates might not be very informative.

gowthamrao commented 6 years ago

Can you help us understand the concept of covariates?

schuemie commented 6 years ago

Ok, let me give it a shot:

First, note that I tend to use the words 'covariate', 'feature', and 'baseline characteristic' interchangeably.

An analysis is defined as a process that creates one or more similar covariates. For example, prespecified analysis 102 creates a separate covariate for each condition concept encountered in the long-term window (default is past 365 days).

A covariate is a variable that will have one value for each entry in the cohort table, and that value can be different for each cohort entry. One such covariate is whether the concept 'fever' (concept ID 437663) has been encountered in the long term window. In analysis 102, we construct a unique covariate ID by concatenating the concept ID to the analysis ID, so this covariate will have ID 437663102. In this case, this covariate is binary; it can only have the value 0 (concept not found in window) or 1 (concept found in window).

So if I think about your analysis 121 (VisitOccurrenceRecordCountsByPosLongTerm), I would expect it to generate one covariate per unique place_of_service_concept_id.