Refactor ratio stats Glue job to dbt python model

dfsnow commented 2 months ago

This PR migrates the ratio_stats Glue job to a dbt Python model that uses Athena's Spark integration as the Python runtime. There are some tradeoffs here:

Pros

Vastly simplifies the Glue job code (no need for polling, context setup, etc)
Integrates the ratio_stats table with the DAG, meaning it will get rebuild on upstream changes AND it works with dbt profiles (so changes can first be built in CI buckets)
Very easy to add further Python models now

Cons

Needs external packages. We'll have to setup zip files of packages in order to use third-party stuff.
Cannot source from views. Athena Spark pulls from the files rather than SQL, so it doesn't work with views at all.
Very new integration. Support was only added in the most recent version of dbt-athena.
Hard to debug
Ugly linting

IMO the pros are worth it here, but it's probably worth it to merge this and keep it up as a test for a month or so before integrating it more deeply.

dfsnow commented 2 months ago

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

jeancochrane commented 1 month ago

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

dfsnow commented 1 month ago

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

Sadly, there are official docs supporting this. I also forked dbt-athena and tested various backtick configurations with the fork, none of them worked. I think basically only the z_ci_ DBs ever even use hyphens, so it's not a major change. But yes, annoying.

dfsnow commented 1 month ago

@jeancochrane @wrridgeway This could probably use one more look before merging. I'll post the diff between the current reporting.ratio_stats and the new version tomorrow.

dfsnow commented 1 month ago

A quick diff shows that the old ratio_stats and new one are basically identical but for the following differences:

Bootstrapping is not deterministic and therefore all _ci columns have different values
The mki column has different values due to https://github.com/ccao-data/assesspy/pull/19
detect_chasing is removed in the new version (replaced with False), due to reasons I outlined earlier. It's not a reliable measurement of chasing and it doesn't work in Spark for some reason
The float columns have a slightly different precision between versions. The new version records more digits

All that said, I think this is ready to merge. I've kept it as close as I can to the original ratio_stats table. FYI @ccao-jardine

wrridgeway commented 1 month ago

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place.

This seems similar to EI issues where we have a separate SQL script and read it in as a pull in the R script.

dfsnow commented 1 month ago

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place. Just a thought.

Probably not tbh, we can likely ditch the input table. However, I'm going to punt that to a follow-up issue since I want to get this merged.

ccao-data / data-architecture

Refactor ratio stats Glue job to dbt python model #422

Pros

Cons