ccao-data / service-spark-iasworld

Service for extracting tables from the CCAO system-of-record and uploading them to the Data Department's data warehouse
GNU Affero General Public License v3.0
0 stars 0 forks source link

Add initial Spark refactor #1

Closed dfsnow closed 2 months ago

dfsnow commented 2 months ago

This PR represents an initial attempt to refactor the increasingly slow and unwieldy service-sqoop-iasworld to Spark. This refactor has numerous benefits:

Note that this PR only includes the extract and partitioning parts of the refactor. I figured it would be worthwhile to break the review up into more manageable chunks. I still need to add:

dfsnow commented 2 months ago

Thanks for the great feedback @jeancochrane! I managed to simplify things quite a bit I think. Ready for one more round of review.

From a big-picture design perspective, one thing I notice is that the SparkJob class is currently designed around the assumption that cur, parid, and taxyr are the only options for our filters/predicates/partitions. If we ever need to change that assumption, I expect we'll need to rewrite most of the SparkJob methods to accommodate new or changed options. I think this is probably fine, since we can always refactor the class to make the filter/predicate/partition options more general, but it's one way in which the abstraction as it's currently designed isn't maximally flexible.

I'm okay with this. I think it's really unlikely that we'll need to change it, and if we do then it shouldn't be too hard to rewrite. It's really simple (now) and we should keep it that way. That said, I rewrote the predicates part to be column agnostic, see below.

I also found the distinction between filters, predicates, and partitions to be a bit confusing, since they all seem to be different kinds of logic conditions, but I think I finally grok it after reading the code.

Your understanding of these items is mostly correct, but I agree that they were really confusing. So, I scrapped the existing logic for something much simpler:

There are probably some opportunities for simplification here, even outside of an effort to make these conditions more flexible -- for example, the fact that the filters are enabled by the use_predicates job config option is a bit confusing, as is the way that the script uses taxyr as both a predicate and a filter. Still, I don't think any of this is super important to worry about right now!

Done! I think I've simplified quite a bit. Let me know what you think!