TuftsCTSI / TRDW.jl

This is a Pluto environment with FunSQL for use in Research Requests
Other
2 stars 0 forks source link

refactor how `table()` functions work, e.g. `drug_exposure()` #31

Closed clarkevans closed 7 months ago

clarkevans commented 9 months ago

When I started this project, I had the idea that I'd try to stick with OMOP columns. This was a bad decision, as you can see from the complexity of join_via_cohort found in linking.jl -- the bulk of this complexity is simply dealing with the variance of column naming.

I started a refactor by adding an event self-join to each table, e.g. drug_exposure().

drug_exposure(match...) = begin
    from(drug_exposure)
    $(length(match) == 0 ? @funsql(define()) : @funsql(filter(drug_matches($match))))
    left_join(concept => concept(),
              drug_concept_id == concept.concept_id, optional=true)
    left_join(type_concept => concept(),
              drug_type_concept_id == type_concept.concept_id, optional=true)
    left_join(route_concept => concept(),
              route_concept_id == route_concept.concept_id, optional=true)
    left_join(source_concept => concept(),
              drug_source_concept_id == source_concept.concept_id, optional=true)
    left_join(visit => visit_occurrence(),
        visit_occurrence_id == visit_occurrence.visit_occurrence_id, optional = true)
    left_join(provider => provider(),
              provider.provider_id == provider.provider_id, optional=true)
    join(event => begin
        from(drug_exposure)
        define(
            table_name => "drug_exposure",
            concept_id => drug_concept_id,
            end_date => drug_exposure_end_date,
            is_historical => drug_exposure_id > 1500000000,
            start_date => drug_exposure_start_date,
            source_concept_id => drug_source_concept_id)
    end, drug_exposure_id == event.drug_exposure_id, optional = true)
end

This would let me write functions generically, using event.start_date and such. However, this sucks, it's got a unnecessary join. So, I was wondering how to do this better. I tried the following...

@funsql drug_exposure() = begin
    from(drug_exposure)
    define(event.id => drug_exposure_id,
           event.kind => drug_exposure,
           event.start_date => drug_exposure_start_date)
end

or, perhaps

@funsql drug_exposure() = begin
    from(drug_exposure)
    define(event =>
       define(id => drug_exposure_id,
              kind => drug_exposure,
              start_date => drug_exposure_start_date))
end

Then I could write

@query drug_exposure().limit(1).select(event.id, event.start_date)

However, these don't seem to work and they create error looking up event. The following works, but it's tedious...

@funsql test_drug() = begin
    from(drug_exposure)
    define(id => drug_exposure_id,
           kind => "drug_exposure",
           start_date => drug_exposure_start_date)
    as(event)
    define(
        drug_exposure_id => base.drug_exposure_id,
        person_id => base.person_id,
        drug_concept_id => base.drug_concept_id,
        drug_exposure_start_date => base.drug_exposure_start_date,
        drug_exposure_start_datetime => base.drug_exposure_start_datetime,
        drug_exposure_end_date => base.drug_exposure_end_date,
        drug_exposure_end_datetime => base.drug_exposure_end_datetime,
        drug_type_concept_id => base.drug_type_concept_id,
        stop_reason => base.stop_reason,
        refills => base.refills,
        quantity => base.quantity,
        days_supply => base.days_supply,
        sig => base.sig,
        route_concept_id => base.route_concept_id,
        lot_number => base.lot_number,
        provider_id => base.provider_id,
        visit_occurrence_id => base.visit_occurrence_id,
        visit_detail_id => base.visit_detail_id,
        drug_source_value => base.drug_source_value,
        drug_source_concept_id => base.drug_source_concept_id,
        route_source_value => base.route_source_value,
        dose_unit_source_value => base.dose_unit_source_value)
end

So this ticket is either to improve FunSQL to make nested definitions work, or to redo everything using the second form (but it's more brittle).

xitology commented 9 months ago

This could be done with a lateral join, like this:

@funsql begin
    from(drug_exposure)
    cross_join(
        event => begin
            define(
                id => :ID,
                kind => "drug_exposure",
                start_date => :START_DATE)
            bind(
                :ID => drug_exposure_id,
                :START_DATE => drug_exposure_start_date)
        end)
end

Databricks should be able to optimize this join away.

This join could be wrapped in a definition

@funsql create_event(; id, kind, start_date) =
    cross_join(
        event => begin
            define(
                id => :ID,
                kind => $kind,
                start_date => :START_DATE)
            bind(
                :ID => $id,
                :START_DATE => $start_date)
        end)

so that the query could be written simply as

@funsql begin
    from(drug_exposure)
    create_event(id = drug_exposure_id, kind = "drug_exposure", start_date = drug_exposure_start_date)
end
clarkevans commented 9 months ago

Would you be able to make this change? We could then start migrating some of the functions to use the more generic approach rather than using prefixes.

xitology commented 9 months ago

Have you considered simply renaming the columns, stripping the table name from the names of columns?

clarkevans commented 9 months ago

It's tempting. However, when displaying tables it would no longer match the OMOP schema. Then I'd have to guess what changed and what is different. Initially I even started with an "condition()" function but ended up going with the more verbose "condition_occurrence()" just to stick with the standard. I was thinking that an "event" define that is silent unless it is used would be a very good compromise, it wouldn't change the names but would permit us to write generic functions.

xitology commented 9 months ago

Speaking of conditions and condition occurrences, there is also a material difference between the two, and both could be employed in the same query. In fact, instead of calling the link corresponding to condition_concept_id a concept, it could be called condition:

from(condition_occurrence)
join(person => from(person), person_id == person.person_id)
join(condition => from(concept), condition_concept_id == condition.concept_id)
...
xitology commented 9 months ago

There is one other limitation of OMOP names: there is no good way to represent a UNION of different types of events. Such unions are occasionally appear in eCQMs, although I don't know if we will need them. Say, for instance, smoking can be represented both as a observation and a condition, how would you show a list of all smoking related events to the researcher?

xitology commented 9 months ago

If we go with event, what fields should it contain?