aws / amazon-mwaa-docker-images

Apache License 2.0
24 stars 11 forks source link

Apply the necessary Airflow configuration on the top process itself (entrypoint.py) instead of just child processes. #99

Open rafidka opened 3 months ago

rafidka commented 3 months ago

Overview

To properly configure Airflow, we use environment variables rather than directly making changes to the airflow.cfg file, as this makes it easier to track the changes we are intentional about, while leaving Airflow's defaults otherwise. Those environment variables are defined by the entrypoint.py, and then passed down to every Airflow process we create during the creation of the process.

This has the advantage of being very clean and explicit, and also keep us in full control as to what to pass to every sub-process, as opposed to the previous images (internal) where we define and export environment variables in the entrypoint.sh file.

One down side of this approach, however, is that our entrypoint.py process doesn't have the required Airflow configuration, meaning that when we import airflow modules, they will have the wrong configuration. One example where this might cause issue is if we want to report metrics using StatsD form the entrypoint.py (or one if its sub-modules). If we simply import Airflow's Stats object, it will have the wrong configuration, and won't report anything (see #100).

To solve this issue, we need to hot-swap the os.environ object after we build our environment variables here. The catch, however, is that this needs to be done before we include any Airflow module, which means we require some substantial refactoring, which for now I am avoiding as it is a bit risky to make substantial changes to code bases right before the launch. However, we should still work on this soon after launch, with proper testing.

Acceptance Criteria

Additional Info

Things to keep in mind: