Closed jrlegrand closed 2 months ago
Making the intermediate steps materialize as tables makes the query run within milliseconds.
Based on some reading, it feels like having these intermediates materialized as tables makes sense. Reference
`Marts level models will often require you to perform aggregation on large quantities of data. Those aggregation queries often scale out of proportion as your project grows since more data needs to be loaded in memory to perform those aggregations each time new data is added to a source.
Bringing this aggregation into a separated intermediate model is a good solution, since you will be able to process the latest data incrementally, something you wouldn’t have been able to do as part of a CTE.`
Materializing that view as a table is definitely a good solution, but it doesn't really have any broader impact beyond this one table.. I think we should consider implementing this as well as the indexes that were recommended in the DB outline that Joey included above. With the indexes in place, we would have a positive impact on many of the upstream views and tables.
With just the indexes, the query runs in 24 seconds. With the indexes AND the model materialized as a table, the query runs in 89 miliseconds.
Also, do we want to materialize ALL the INT tables? or just this one..? everything will be improved by the indexes, but materializing the tables will make the DAGs slower since that process is slower than doing the views.
We've finalized the changes required to implement the postgres indexes - I'm going to create a branch here in a second and associate it here
There were more changes than anticipated, including implementing our own airflow.cfg file (which is recommended by airflow for production setups) where we change a parameter row_level_locking to FALSE which allows for increased task concurrency, but could possibly introduce some task conflicts down the line. This was necessary to circumvent database deadlocks that were occurring previously.
branch/PR created ^^ recommend we test this branch out for a bit before merging due to the change in the airflow config (if we decide to merge at all)
broader impact beyond this one table
that is incorrect, this change will allow all intermediate model to be built as tables, so it has a widest ranging impact.
materializing the tables will make the DAGs slower since that process is slower than doing the views.
Correct, but I dont think that is an issue since we are moving heavy computation upstream. Ideally we would have rxnorm marts that completely capture all computation upfront.
everything will be improved by the indexes
Yes i agree that we should index where possible to improve performance. Indexes need to be thought out carefully to ensure the correct columns are used.
Problem Statement
Querying RxNorm sometimes crashes postgres b/c of how inefficiently this (one of our first) DAGs was implemented and never updated.
This is the error:
Criteria for Success
Should be able to run queries like the one in the screenshot below on the production server without it crashing. Honestly, you might try running a more complex query - but even simple ones like below can crash it.
I think there are at least 2 things that can make a big difference
This is what's inside of that
indexes_mysql_rxn
file:source
) - I don't think this will improve efficiency but it might?Additional Information
This is the innocuous query that either breaks Postgres or takes 11+ minutes to run and then I give up on it: