Closed Kytha closed 2 months ago
On first inspection the change makes sense.
What is the cause of the long schedule_run_dag in the second image?
On first inspection the change makes sense.
What is the cause of the long schedule_run_dag in the second image?
This call was during scheduler process of making scheduling decision for a given dag run. If the given dag run has finished, then the db session is flushed to sync session with db. This is what took so long. So it is the case that during the 1-minute period I polled, a task happened to finish and triggered this code.
It does take a suspiciously long time to do this, and may be an area of optimization in the future, but I don't believe it is related to any logging changes here.
Description of changes:
Performance testing MWAA airflow images revealed flaws with how logging is plumbed to CloudWatch.
Currently, when airflow publishes log statements it is sent to an os pipe to be polled and read by the entrypoint process and then forwarded to cloudwatch. When this pipe is full, airflow must halt until logs are read before continuing execution. This means airflow execution is upper-bounded by how fast the entrypoint can read logs.
The entrypoint has many other priorities in this critical loop (heartbeat healthcheck, db healthcheck, polling other subprocess logs, etc) and it has be measured that airflow scheduler is significantly throttled by this mechanism when log volume is large.
This PR aims to fix this by separating log polling into its own thread. This allows for faster log reading that is decoupled from the execution of the entrypoint process loop. Below is "before" and "after" profiler flamegraph results of a mwaa scheduler over a 1-minute period of execution under heavy scheduling and log load.
Before
After
In the before, we see that waiting for the log message to be read by the entry point is all the scheduler was able to do in the entire 1-minute interval. However with async logging, scheduler is able to complete multiple loops.