coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
41 stars 11 forks source link

Optimize RxNorm #270

Closed jrlegrand closed 2 months ago

jrlegrand commented 3 months ago

Problem Statement

Querying RxNorm sometimes crashes postgres b/c of how inefficiently this (one of our first) DAGs was implemented and never updated.

This is the error:

ERROR:  could not resize shared memory segment "/PostgreSQL.1233899772" to 2097152 bytes: No space left on device
SQL state: 53100

Criteria for Success

Should be able to run queries like the one in the screenshot below on the production server without it crashing. Honestly, you might try running a more complex query - but even simple ones like below can crash it.

I think there are at least 2 things that can make a big difference

  1. Add indexes to the raw data tables from RxNorm. See the file in the RxNorm files scripts folder for which indexes RxNorm adds when creating a MySQL database. There's also an example for Oracle. image

This is what's inside of that indexes_mysql_rxn file:

CREATE INDEX X_RXNCONSO_STR ON RXNCONSO(STR);
CREATE INDEX X_RXNCONSO_RXCUI ON RXNCONSO(RXCUI);
CREATE INDEX X_RXNCONSO_TTY ON RXNCONSO(TTY);
CREATE INDEX X_RXNCONSO_CODE ON RXNCONSO(CODE);

CREATE INDEX X_RXNSAT_RXCUI ON RXNSAT(RXCUI);
CREATE INDEX X_RXNSAT_ATV ON RXNSAT(ATV);
CREATE INDEX X_RXNSAT_ATN ON RXNSAT(ATN);

CREATE INDEX X_RXNREL_RXCUI1 ON RXNREL(RXCUI1);
CREATE INDEX X_RXNREL_RXCUI2 ON RXNREL(RXCUI2);
CREATE INDEX X_RXNREL_RELA ON RXNREL(RELA);

CREATE INDEX X_RXNATOMARCHIVE_RXAUI ON RXNATOMARCHIVE(RXAUI);
CREATE INDEX X_RXNATOMARCHIVE_RXCUI ON RXNATOMARCHIVE(RXCUI);
CREATE INDEX X_RXNATOMARCHIVE_MERGED_TO ON RXNATOMARCHIVE(MERGED_TO_RXCUI);
  1. Re-write the dbt DAG SQL to actually use dbt. I know we re-write a lot of SQL that should instead be a model that other tables reference. Also, I doubt we are referencing the datasource files correctly (the dbt way with source) - I don't think this will improve efficiency but it might?

Additional Information

This is the innocuous query that either breaks Postgres or takes 11+ minutes to run and then I give up on it:

select
*
from intermediate.int_rxnorm_clinical_products_to_ingredient_strengths
where clinical_product_rxcui = '996520'

image

lprzychodzien commented 2 months ago

Making the intermediate steps materialize as tables makes the query run within milliseconds.

image
lprzychodzien commented 2 months ago

Based on some reading, it feels like having these intermediates materialized as tables makes sense. Reference

`Marts level models will often require you to perform aggregation on large quantities of data. Those aggregation queries often scale out of proportion as your project grows since more data needs to be loaded in memory to perform those aggregations each time new data is added to a source.

Bringing this aggregation into a separated intermediate model is a good solution, since you will be able to process the latest data incrementally, something you wouldn’t have been able to do as part of a CTE.`

leemlb06pmi commented 2 months ago

Materializing that view as a table is definitely a good solution, but it doesn't really have any broader impact beyond this one table.. I think we should consider implementing this as well as the indexes that were recommended in the DB outline that Joey included above. With the indexes in place, we would have a positive impact on many of the upstream views and tables.

With just the indexes, the query runs in 24 seconds. With the indexes AND the model materialized as a table, the query runs in 89 miliseconds.

Also, do we want to materialize ALL the INT tables? or just this one..? everything will be improved by the indexes, but materializing the tables will make the DAGs slower since that process is slower than doing the views.

leemlb06pmi commented 2 months ago

We've finalized the changes required to implement the postgres indexes - I'm going to create a branch here in a second and associate it here

There were more changes than anticipated, including implementing our own airflow.cfg file (which is recommended by airflow for production setups) where we change a parameter row_level_locking to FALSE which allows for increased task concurrency, but could possibly introduce some task conflicts down the line. This was necessary to circumvent database deadlocks that were occurring previously.

leemlb06pmi commented 2 months ago

branch/PR created ^^ recommend we test this branch out for a bit before merging due to the change in the airflow config (if we decide to merge at all)

lprzychodzien commented 2 months ago

broader impact beyond this one table

that is incorrect, this change will allow all intermediate model to be built as tables, so it has a widest ranging impact.

materializing the tables will make the DAGs slower since that process is slower than doing the views.

Correct, but I dont think that is an issue since we are moving heavy computation upstream. Ideally we would have rxnorm marts that completely capture all computation upfront.

everything will be improved by the indexes

Yes i agree that we should index where possible to improve performance. Indexes need to be thought out carefully to ensure the correct columns are used.