ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Refactor `ratio_stats` job to use Pyspark #436

Open dfsnow opened 1 month ago

dfsnow commented 1 month ago

The recently merged #422 has a Python dbt model (ratio_stats.py) that runs on Athena's Spark backend. The model almost exclusively uses pandas for data munging and processing. This works well and is simple, but misses out on some of the benefits of using Spark (parallelization). We should try a quick refactor of the ratio_stats model using PySpark code to see if we can gain some of the benefits of Spark. Mainly, the current Pandas job takes 1 hour to finish, while the Spark version is likely to be much faster.

We can also make a few other enhancements here at the same time. Namely:

These will need input from @ccao-jardine and @wrridgeway.

dfsnow commented 3 weeks ago

@wagnerlmichael This one is yours now. Let's use it to pilot use of Spark models within dbt, since we may want to convert sales val, source-of-truth, etc to Spark. Let's also take this opportunity to clean up the ratio_stats table a little bit (get the dtypes corrected, drop extraneous columns, etc.).

ccao-jardine commented 2 weeks ago

ratio_stats is used in production for our public-facing ratio study dashboards, which are published for mailed stage each reassessed township when it mails.

This is one dashboard serving all townships, with an extract of the ratio_stats table that is refreshed with each 2024 township mailing. Because of that I'd very strongly prefer to not make any changes to the production table until after we have mailed the last tri town this year.

If changes must be made now because it's blocking other work, please sequence with me on schedule so that changes aren't pushed close to a town mail date.

If it helps, the current structure of the reporting depends on no changes (data type, etc.) to the following columns in the production table:

This table is filtered to geography_type = "Town", so if other types are added, it should be robust to those changes.

Which extraneous columns are you thinking of getting rid of?

dfsnow commented 2 weeks ago

Got it. @wagnerlmichael don't mess with any of the column dtypes. We'll move the cleanup stuff to a separate issue.