apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.05k stars 14.29k forks source link

šŸ§Ÿ Slow Zombie query causes scheduler heartbeat failures without errors #32986

Open getaaron opened 1 year ago

getaaron commented 1 year ago

Apache Airflow version

2.6.3

What happened

Due to an error with our postgres database (the stats on the tables were stale) this query:

https://github.com/apache/airflow/blob/1e20ef215ab8e688dc4331513fc5df34db443e84/airflow/jobs/scheduler_job_runner.py#L1686-L1698

took a very long time to return. During this time, heartbeats were not written, which caused health check failures (including k8s start / liveness check failures).

It took several days of debugging to track down the cause because airflow does not log any errors in this case. We resolved it by running ANALYZE.

What you think should happen instead

Airflow should log warnings/errors if queries that are expected to return quickly take a long time to return.

How to reproduce

  1. Make your postgres database slow (either don't analyze statistics, or just change the query to something like SELECT pg_sleep(2400); for testing)
  2. Try to run the airflow scheduler
  3. Notice that heartbeats are not written frequently and no warnings or errors are logged

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

n/a

Deployment

Docker-Compose

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Code of Conduct

RNHTTR commented 1 year ago

It seems like a tall order to ask Airflow to detect long-running queries. Given that this was "an error with [y]our postgres database", it feels like this isn't really an Airflow issue?

getaaron commented 1 year ago

I hope it's not a tall order, although I'm not familiar with Airflow's database code. Most database client libraries and ORMs have the ability to set timeouts on their SQL queries. A simple approach could be to add a default 60 second timeout in the session:

SET statement_timeout = ā€˜60sā€™;

This could be overridden if there are queries that are expected to run long, and could be configurable via an Airflow environment variable.

I think it's an Airflow problem because:

  1. The Airflow scheduler should be designed not to hang when a database (or other dependency) is slow to respond
  2. Airflow supports health check functionality, which should be designed to be useful. The health check functionality is much less helpful if Airflow gives no indication about why the health check has failed, when it reasonably could.

To be clear, I'm not asking Airflow to diagnose the problem with the database, simply to emit a log which points to the database as a culprit.