Add initial Spark refactor

This PR represents an initial attempt to refactor the increasingly slow and unwieldy service-sqoop-iasworld to Spark. This refactor has numerous benefits:

Spark is the industry standard for this type of work, while sqoop is deprecated and no longer maintained
PySpark code is infinitely more readable and maintainable compared to the sqoop code, which is mostly bash
PySpark extractions tend to be a full order of magnitude faster than sqoop due to the use of SQL predicates to create more discrete extract tasks
- ASMT_ALL WHERE taxyr IN (2023, 2024) via sqoop = 1H30M
- ASMT_ALL WHERE taxyr IN (2023, 2024) via Spark = 14M
Job definitions are clearer and easier to maintain (via their own dedicated YAML file). They're also far more flexible e.g. only get cur = 'Y' values for 2023
The output of the extract jobs is a zstd-compressed Parquet file per partition, compared to the many files currently created per partition via sqoop. This should make our Athena queries slightly faster
Making table extracts only via indexed columns (PARID, TAXYR) may actually result in faster queries AND less load on the iasWorld database (compared to Enquire queries, which use a lot of un-indexed columns)

Note that this PR only includes the extract and partitioning parts of the refactor. I figured it would be worthwhile to break the review up into more manageable chunks. I still need to add:

Upload to S3
Job validation
Logging
Schema assurance/conformity to existing sqoop types

Thanks for the great feedback @jeancochrane! I managed to simplify things quite a bit I think. Ready for one more round of review.

From a big-picture design perspective, one thing I notice is that the SparkJob class is currently designed around the assumption that cur, parid, and taxyr are the only options for our filters/predicates/partitions. If we ever need to change that assumption, I expect we'll need to rewrite most of the SparkJob methods to accommodate new or changed options. I think this is probably fine, since we can always refactor the class to make the filter/predicate/partition options more general, but it's one way in which the abstraction as it's currently designed isn't maximally flexible.

I'm okay with this. I think it's really unlikely that we'll need to change it, and if we do then it shouldn't be too hard to rewrite. It's really simple (now) and we should keep it that way. That said, I rewrote the predicates part to be column agnostic, see below.

I also found the distinction between filters, predicates, and partitions to be a bit confusing, since they all seem to be different kinds of logic conditions, but I think I finally grok it after reading the code.

Your understanding of these items is mostly correct, but I agree that they were really confusing. So, I scrapped the existing logic for something much simpler:

Predicates are now entirely separate from parid and taxyr. They're just arbitrary SQL BETWEEN expressions loaded from a file for each job. This means we could load an arbitrary set of predicates per table. For example SALES could be divided up by saledt instead of parid.
Filters and partitions are just automatically enabled as long as either taxyr or cur is not null. For example, min_year=2021, max_year=2023 and cur='Y' will yield all data between 2021 and 2023 (inclusive), where cur='Y', output to files partitioned by both taxyr and cur. If both the year values and cur are falsey, no filtering or partitioning is performed (we just grab the whole table).

There are probably some opportunities for simplification here, even outside of an effort to make these conditions more flexible -- for example, the fact that the filters are enabled by the use_predicates job config option is a bit confusing, as is the way that the script uses taxyr as both a predicate and a filter. Still, I don't think any of this is super important to worry about right now!

Done! I think I've simplified quite a bit. Let me know what you think!

ccao-data / service-spark-iasworld

Add initial Spark refactor #1