coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
41 stars 11 forks source link

Rxnorm dbt update and intermediate tables #280

Open lprzychodzien opened 2 months ago

lprzychodzien commented 2 months ago

Resolves #270

Explanation

Changes rxnorm staging and intermediate queries to using dbt's jinja table references.

Most important is that it sets intermediate models to materialize as tables (vs views).

Rationale

dbt mart models are made up of complex logic that should be captured in the intermediate models. The issues already that we ran into (issue #270 ) was that performing these aggregations for each query took 10+ minutes. Therefore, bringing this aggregation into a separated intermediate model to be materialized as a table is a good solution. The materialization of these intermediate tables can take awhile but will significantly speed up queries.

Additional work can be done to optimize the queries that build out intermediate models in the future.

Tests

  1. What testing did you do? dbt run --full-refresh
  2. Attach testing logs inside a summary block:
testing logs ``` ```
jrlegrand commented 2 months ago

Man you did tons of work here. Great job. Fundamental question themes:

  1. Should we title ctes specifically to what they represent even if we are just pulling in rxnconso multiple times? Or should we have one rxnconso cte per model and use aliases and leave everything else the same? Or should we name them what they represent and pull the where clause into the cte? Or should we make like other "base" staging tables with them? Or should we just look for opportunities to reference these staging models from other int/stg models instead of going back to source all the time?
  2. I'm def referencing source tables from a lot of intermediate models - shame on me. I need to fix this.
  3. Minor formatting changes still needed to satisfy my OCD (lowercase SQL commands and tab indent in some places).
  4. Some ref's still in select from instead of within a cte first.
  5. Maybe we should build a dbt macro for the active / prescribable columns - prob in a separate issue.

Another major thing I will look at is taking a 50,000 foot view of the entire RxNorm data model - it's hard to see when zoomed in on a given intermediate model, but I think I re-wrote a lot of the code in other intermediate models inside of intermediate models. So I think there's an opportunity to ref intermediate models instead of re-writing that SQL.