The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
The new query doesn't correctly filter by job_type (notice the filter clause is job.job_type = job.job_type):
SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname
FROM job
WHERE job.job_type = job.job_type AND job.hostname = %(hostname_1)s ORDER BY job.latest_heartbeat DESC
LIMIT %(param_1)s
Old query used in helm chart v8.7.0 and earlier used to filter byjob_type:
SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname
FROM job
WHERE job.hostname = %(hostname_1)s AND job.job_type IN (__[POSTCOMPILE_job_type_1]) ORDER BY job.latest_heartbeat DESC
LIMIT %(param_1)s
While the new query does work, its inefficiency is can cause database IO to be throttled, resulting in liveness probe failures due to query timeouts.
Relevant Logs
Warning Unhealthy 7m29s (x5 over 10m) kubelet Liveness probe failed: The SchedulerJob (id=97704647) for hostname 'airflow-scheduler-8586fc7d9f-br7vb' is not alive
Checks
User-Community Airflow Helm Chart
.Chart Version
8.7.1
Kubernetes Version
Helm Version
Description
This change to the liveness probes in support of airflow 2.6.0 is causing high usage database on IO when using airflow 2.5.3.
From what I can tell, the part of the query that is causing poor performance is this change:
I believe the issue is that the SchedulerJob class in 2.5.3 doesn't set a value for
job_type
.The new query doesn't correctly filter by
job_type
(notice the filter clause isjob.job_type = job.job_type
):Old query used in helm chart v8.7.0 and earlier used to filter by
job_type
:While the new query does work, its inefficiency is can cause database IO to be throttled, resulting in liveness probe failures due to query timeouts.
Relevant Logs
Custom Helm Values
No response