Currently, all operators are based on an EWAHBaseOperator that contains all necessary functionality for loading data into the DWH and individual operators contain the logic of extracting data in the execute() function. This will inevitably lead to dependency conflicts. To avoid this in the future, I see two options:
refactor all operators as children of the kubernetes pod operator, and make sure that each image either contains only the required dependencies or that they are installed after the pod is spun up, or
refactor all operators as children of the python virtualenv operator, and make that they contain their dependencies to be installed within the virtualenv at execution
Initially I preferred the kubernetes pod operator options, but this would limit the usecases of EWAH by excluding all users who are unwilling or unable to use kubernetes; the alternatively would have been some ugly hybrid which I didn't exactly fancy either.
Though requiring more refactoring, the second option now seems preferrable to me and comes with a few unexpected upsides. What follows is a high-level description of how I imagine EWAH operators to work in the future.
Recap of general EWAH Extract and Load logic
Each data source has one DAG (full refresh) or set of DAGs (incremental load) that loads data into one raw data schema in the DWH
Each full-refresh EL DAG has three components: two schema tasks (kickoff and final) and one type extract and load tasks, which is instantiated once per table that is loaded from the data source
Incremental EL DAGs have an additional sensor task to make sure they are executed in order and backfill properly
on a table level, loading can occur with chunking of requests by either a timestamp or serial (integer) field
incremental models need a timestamp field to incrementally load by
also planned for the future: hybrid model which incrementally loads immutable data based on a serial
How an EWAHBaseOpator based on the python virtualenv operator might be constructed logically
Components of the base operator
init()
save all operator kwargs related to the base operator
test values of all operator kwargs for validity and compatibility
set the callable kwarg to a function of the class itself which is overwritten by child operators
call super()
execute()
contain logic for full-refresh, incremental loading, and various chunking mechanisms
calls super() for each chunk, returning the result of the callable
operator_callable()
function returning None which is overwritten by the child class and in the base operator returns None
With airflow 2.0.0, there is another option: Building individual provider packages.
Steps
Make EWAH itself a (single monolithic) provider package and introduce hooks in EWAH
Refactor operators such that the data loading happens via function calls to the hooks, and limit the logic of operators themselves (they should really only fetch data and upload it)
Split EWAH into many individual provider packages, one per data source
Currently, all operators are based on an
EWAHBaseOperator
that contains all necessary functionality for loading data into the DWH and individual operators contain the logic of extracting data in theexecute()
function. This will inevitably lead to dependency conflicts. To avoid this in the future, I see two options:Initially I preferred the kubernetes pod operator options, but this would limit the usecases of EWAH by excluding all users who are unwilling or unable to use kubernetes; the alternatively would have been some ugly hybrid which I didn't exactly fancy either.
Though requiring more refactoring, the second option now seems preferrable to me and comes with a few unexpected upsides. What follows is a high-level description of how I imagine EWAH operators to work in the future.
Recap of general EWAH Extract and Load logic
kickoff
andfinal
) and one type extract and load tasks, which is instantiated once per table that is loaded from the data sourceHow an
EWAHBaseOpator
based on the python virtualenv operator might be constructed logicallyinit()
super()
execute()
super()
for each chunk, returning the result of the callableoperator_callable()
None
which is overwritten by the child class and in the base operator returnsNone
init()