Closed dfsnow closed 1 month ago
@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?
@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?
@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).
@jeancochrane Hit another small roadblock on this after a bunch of permissions debugging. Seems like spark can't write tables with a dash in the name. Not an issue in prod, but doesn't work with our kebab-case name spacing schema. Do you think it's worth changing the macro to force everything to snake_case?
@dfsnow It's a little bit annoying to have to change our naming scheme, but there's not actually a huge difference between the hyphenated vs. underscored versions so I don't mind. Were you able to find documentation confirming this limitation with Athena PySpark? I couldn't find anything so I'm wondering if the root cause is actually an error in the underlying plugin implementation (e.g. not using backticks to escape special characters).
Sadly, there are official docs supporting this. I also forked dbt-athena
and tested various backtick configurations with the fork, none of them worked. I think basically only the z_ci_
DBs ever even use hyphens, so it's not a major change. But yes, annoying.
@jeancochrane @wrridgeway This could probably use one more look before merging. I'll post the diff between the current reporting.ratio_stats
and the new version tomorrow.
A quick diff shows that the old ratio_stats
and new one are basically identical but for the following differences:
_ci
columns have different valuesmki
column has different values due to https://github.com/ccao-data/assesspy/pull/19detect_chasing
is removed in the new version (replaced with False
), due to reasons I outlined earlier. It's not a reliable measurement of chasing and it doesn't work in Spark for some reasonAll that said, I think this is ready to merge. I've kept it as close as I can to the original ratio_stats
table. FYI @ccao-jardine
Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place.
This seems similar to EI issues where we have a separate SQL script and read it in as a pull in the R script.
Assuming testing comes back fine, looks good to me, though this all begs the question: is there a point to having the input table exist rather than just being a pull in the python script to generate stats any longer if both processes are handled by dbt on a daily basis and the input table serves no function outside feeding the script? It probably didn't need to exist in the first place. Just a thought.
Probably not tbh, we can likely ditch the input table. However, I'm going to punt that to a follow-up issue since I want to get this merged.
This PR migrates the
ratio_stats
Glue job to a dbt Python model that uses Athena's Spark integration as the Python runtime. There are some tradeoffs here:Pros
ratio_stats
table with the DAG, meaning it will get rebuild on upstream changes AND it works with dbt profiles (so changes can first be built in CI buckets)Cons
dbt-athena
.IMO the pros are worth it here, but it's probably worth it to merge this and keep it up as a test for a month or so before integrating it more deeply.