Gemma-Analytics / ewah

ELT With Airflow Helper - Classes and functions to make apache airflow life easier
MIT License
12 stars 2 forks source link

Refactor Operator Logic to Avoid Package Dependency Conflicts #9

Open soltanianalytics opened 4 years ago

soltanianalytics commented 4 years ago

Currently, all operators are based on an EWAHBaseOperator that contains all necessary functionality for loading data into the DWH and individual operators contain the logic of extracting data in the execute() function. This will inevitably lead to dependency conflicts. To avoid this in the future, I see two options:

Initially I preferred the kubernetes pod operator options, but this would limit the usecases of EWAH by excluding all users who are unwilling or unable to use kubernetes; the alternatively would have been some ugly hybrid which I didn't exactly fancy either.

Though requiring more refactoring, the second option now seems preferrable to me and comes with a few unexpected upsides. What follows is a high-level description of how I imagine EWAH operators to work in the future.

Recap of general EWAH Extract and Load logic

How an EWAHBaseOpator based on the python virtualenv operator might be constructed logically

soltanianalytics commented 3 years ago

With airflow 2.0.0, there is another option: Building individual provider packages.

Steps

  1. Make EWAH itself a (single monolithic) provider package and introduce hooks in EWAH
  2. Refactor operators such that the data loading happens via function calls to the hooks, and limit the logic of operators themselves (they should really only fetch data and upload it)
  3. Split EWAH into many individual provider packages, one per data source