ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
5 stars 3 forks source link

Refactor ratio stats Glue job to dbt python model #422

Closed dfsnow closed 1 month ago

dfsnow commented 2 months ago

This PR migrates the ratio_stats Glue job to a dbt Python model that uses Athena's Spark integration as the Python runtime. There are some tradeoffs here:

Pros

Cons

IMO the pros are worth it here, but it's probably worth it to merge this and keep it up as a test for a month or so before integrating it more deeply.

dfsnow commented 2 months ago

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

jeancochrane commented 1 month ago

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

dfsnow commented 1 month ago

@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?

@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).

Sadly, there are official docs supporting this. I also forked dbt-athena and tested various backtick configurations with the fork, none of them worked. I think basically only the z_ci_ DBs ever even use hyphens, so it's not a major change. But yes, annoying.

dfsnow commented 1 month ago

@jeancochrane @wrridgeway This could probably use one more look before merging. I'll post the diff between the current reporting.ratio_stats and the new version tomorrow.

dfsnow commented 1 month ago

A quick diff shows that the old ratio_stats and new one are basically identical but for the following differences:

All that said, I think this is ready to merge. I've kept it as close as I can to the original ratio_stats table. FYI @ccao-jardine

wrridgeway commented 1 month ago

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place.

This seems similar to EI issues where we have a separate SQL script and read it in as a pull in the R script.

dfsnow commented 1 month ago

Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place. Just a thought.

Probably not tbh, we can likely ditch the input table. However, I'm going to punt that to a follow-up issue since I want to get this merged.