aws / amazon-mwaa-docker-images

Apache License 2.0
27 stars 11 forks source link

Refactor logging to be async #134

Closed Kytha closed 2 months ago

Kytha commented 2 months ago

Description of changes:

Performance testing MWAA airflow images revealed flaws with how logging is plumbed to CloudWatch.

Currently, when airflow publishes log statements it is sent to an os pipe to be polled and read by the entrypoint process and then forwarded to cloudwatch. When this pipe is full, airflow must halt until logs are read before continuing execution. This means airflow execution is upper-bounded by how fast the entrypoint can read logs.

The entrypoint has many other priorities in this critical loop (heartbeat healthcheck, db healthcheck, polling other subprocess logs, etc) and it has be measured that airflow scheduler is significantly throttled by this mechanism when log volume is large.

This PR aims to fix this by separating log polling into its own thread. This allows for faster log reading that is decoupled from the execution of the entrypoint process loop. Below is "before" and "after" profiler flamegraph results of a mwaa scheduler over a 1-minute period of execution under heavy scheduling and log load.

Before

Screenshot 2024-08-27 at 1 43 08 AM

After

Screenshot 2024-08-27 at 1 40 26 AM

In the before, we see that waiting for the log message to be read by the entry point is all the scheduler was able to do in the entire 1-minute interval. However with async logging, scheduler is able to complete multiple loops.

dhegberg commented 2 months ago

On first inspection the change makes sense.

What is the cause of the long schedule_run_dag in the second image?

Kytha commented 2 months ago

On first inspection the change makes sense.

What is the cause of the long schedule_run_dag in the second image?

This call was during scheduler process of making scheduling decision for a given dag run. If the given dag run has finished, then the db session is flushed to sync session with db. This is what took so long. So it is the case that during the 1-minute period I polled, a task happened to finish and triggered this code.

It does take a suspiciously long time to do this, and may be an area of optimization in the future, but I don't believe it is related to any logging changes here.