Health monitoring and alerting for the Airflow scheduler

WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.

https://openverse.org

MIT License

232 stars 181 forks source link

Health monitoring and alerting for the Airflow scheduler #2335

Open zackkrida opened 1 year ago

zackkrida commented 1 year ago

Problem

Currently, Airflow encounters issues we only become aware of them by happenstance; if someone peeks at the Airflow UI or a DAG reports an error. Recent issues we didn't detect include:

the Airflow scheduler failed
the DB connection to Postgres was lost after a DB reset

Description

Implement a healthcheck for the Airflow scheduler which sends an Alert in AWS. Also look into the current situation concerning airflow logs in CloudWatch.

Ping /health on the webserver box and send a slack ping to the alerts channel if any component is unhealthy. Add a cron job alongside the dag-sync script to run this check every 30 seconds.

AetherUnbound commented 1 year ago

I know the webserver monitors a heartbeat on the scheduler, I have to wonder if that's something we could tap into. Additionally, an alternative scenario would be to have the scheduler exit and have the container restart automatically. I'm not sure why the DB shutdown would not cause the scheduler to stop (the container was still running recently when it encountered the database is shutting down exception).

sarayourfriend commented 1 year ago

Here are the airflow docs on health monitoring: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html

We can't access HTTP Airflow outside of our VPC, so we cannot rely on, for example, UptimeRobot to make requests to /health.

We could, however, add a cron job to the webserver box that calls the /health endpoint and sends a Slack ping (or something) if any of the status fields are unhealthy. If we worked on https://github.com/WordPress/openverse-infrastructure/issues/482 to enable tunnelling through Cloudflare Access so that we could make HTTP requests to Airflow, we could implement the job outside the box, in a GitHub cron action, for example, and write our own mini-uptime HTTP check. Theoretically this could be more stable/reliable than the box reporting its own health, but I don't think that's necessary, because we do run Airflow in Docker, so it isn't likely for the entire EC2 instance to crash.

sarayourfriend commented 1 year ago

To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB. Individual EC2 instances do not have "health checks" in the same way as those meta-resources, as far as I can tell.

AetherUnbound commented 1 year ago

I think a simple slack ping makes sense! We do something similar for the dag-sync on that box:

https://github.com/WordPress/openverse-infrastructure/blob/5bb2a1d9046a9734e66ec33cb734abdac1cc0503/modules/services/catalog-airflow/init.tpl#L130

AetherUnbound commented 1 year ago

The scheduler went down again recently because the upstream database restarted during the maintenance window. What confuses & frustrates me is that the scheduler clearly failed, but the container was still running and hadn't exited (which would have restarted it and thus meant that the scheduler came back online). Perhaps we can look into why that's happening as well as part of this effort.

Edit: I'm going to make a separate issue for that actually and investigate it.

AetherUnbound commented 8 months ago

To leverage AWS monitoring tools for this we'd need to put Airflow boxes into an ASG or target group + LB.

Looking at this again, it appears that we do have Airflow behind a target group + LB: https://github.com/WordPress/openverse-infrastructure/blob/754fc882b93c41c4085f668880f197d7c89bc893/modules/services/catalog-airflow/load-balancer.tf#L47-L74

We also have the unhealthy host count alarm which could be leveraged here too - this says it's for ECS, but the only piece that's ECS-specific appears to be the log link.

It might be possible to hook this up now, but it seems advisable to wait until https://github.com/WordPress/openverse/issues/2037 is complete. That project may alter the way Airflow is defined in Terraform, so it might be ideal to let the dust from that settle before adding an alarm for Airflow before it can be moved to next/.

sarayourfriend commented 7 months ago

Let's use the unhealthy host count alarm and expand it to include EC2-only services as you suggested :+1: