Open getaaron opened 1 year ago
It seems like a tall order to ask Airflow to detect long-running queries. Given that this was "an error with [y]our postgres database", it feels like this isn't really an Airflow issue?
I hope it's not a tall order, although I'm not familiar with Airflow's database code. Most database client libraries and ORMs have the ability to set timeouts on their SQL queries. A simple approach could be to add a default 60 second timeout in the session:
SET statement_timeout = ā60sā;
This could be overridden if there are queries that are expected to run long, and could be configurable via an Airflow environment variable.
I think it's an Airflow problem because:
To be clear, I'm not asking Airflow to diagnose the problem with the database, simply to emit a log which points to the database as a culprit.
Apache Airflow version
2.6.3
What happened
Due to an error with our postgres database (the stats on the tables were stale) this query:
https://github.com/apache/airflow/blob/1e20ef215ab8e688dc4331513fc5df34db443e84/airflow/jobs/scheduler_job_runner.py#L1686-L1698
took a very long time to return. During this time, heartbeats were not written, which caused health check failures (including k8s start / liveness check failures).
It took several days of debugging to track down the cause because airflow does not log any errors in this case. We resolved it by running
ANALYZE
.What you think should happen instead
Airflow should log warnings/errors if queries that are expected to return quickly take a long time to return.
How to reproduce
SELECT pg_sleep(2400);
for testing)Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
n/a
Deployment
Docker-Compose
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct