Closed dfsnow closed 2 months ago
Thanks for the great feedback @jeancochrane! I managed to simplify things quite a bit I think. Ready for one more round of review.
From a big-picture design perspective, one thing I notice is that the
SparkJob
class is currently designed around the assumption thatcur
,parid
, andtaxyr
are the only options for our filters/predicates/partitions. If we ever need to change that assumption, I expect we'll need to rewrite most of theSparkJob
methods to accommodate new or changed options. I think this is probably fine, since we can always refactor the class to make the filter/predicate/partition options more general, but it's one way in which the abstraction as it's currently designed isn't maximally flexible.
I'm okay with this. I think it's really unlikely that we'll need to change it, and if we do then it shouldn't be too hard to rewrite. It's really simple (now) and we should keep it that way. That said, I rewrote the predicates part to be column agnostic, see below.
I also found the distinction between filters, predicates, and partitions to be a bit confusing, since they all seem to be different kinds of logic conditions, but I think I finally grok it after reading the code.
Your understanding of these items is mostly correct, but I agree that they were really confusing. So, I scrapped the existing logic for something much simpler:
parid
and taxyr
. They're just arbitrary SQL BETWEEN expressions loaded from a file for each job. This means we could load an arbitrary set of predicates per table. For example SALES
could be divided up by saledt
instead of parid
.taxyr
or cur
is not null. For example, min_year=2021
, max_year=2023
and cur='Y'
will yield all data between 2021 and 2023 (inclusive), where cur='Y'
, output to files partitioned by both taxyr
and cur
. If both the year values and cur
are falsey, no filtering or partitioning is performed (we just grab the whole table).There are probably some opportunities for simplification here, even outside of an effort to make these conditions more flexible -- for example, the fact that the filters are enabled by the
use_predicates
job config option is a bit confusing, as is the way that the script usestaxyr
as both a predicate and a filter. Still, I don't think any of this is super important to worry about right now!
Done! I think I've simplified quite a bit. Let me know what you think!
This PR represents an initial attempt to refactor the increasingly slow and unwieldy service-sqoop-iasworld to Spark. This refactor has numerous benefits:
ASMT_ALL WHERE taxyr IN (2023, 2024)
via sqoop = 1H30MASMT_ALL WHERE taxyr IN (2023, 2024)
via Spark = 14Mcur = 'Y'
values for 2023Note that this PR only includes the extract and partitioning parts of the refactor. I figured it would be worthwhile to break the review up into more manageable chunks. I still need to add: