airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
665 stars 475 forks source link

liveness probe queries causing high database IO usage on airflow 2.5.3 #829

Closed bitsofdave closed 6 months ago

bitsofdave commented 9 months ago

Checks

Chart Version

8.7.1

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.17-eks-5e0fdde", GitCommit:"221f136737816ce068aa975f082cd0bf81c7ad5f", GitTreeState:"clean", BuildDate:"2024-01-02T20:35:57Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.6"}

Description

This change to the liveness probes in support of airflow 2.6.0 is causing high usage database on IO when using airflow 2.5.3.

From what I can tell, the part of the query that is causing poor performance is this change:

-                          .query(SchedulerJob) \
+                          .query(Job) \
+                          .filter_by(job_type=SchedulerJobRunner.job_type) 

I believe the issue is that the SchedulerJob class in 2.5.3 doesn't set a value for job_type.

The new query doesn't correctly filter by job_type (notice the filter clause is job.job_type = job.job_type):

SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
FROM job 
WHERE job.job_type = job.job_type AND job.hostname = %(hostname_1)s ORDER BY job.latest_heartbeat DESC 
 LIMIT %(param_1)s

Old query used in helm chart v8.7.0 and earlier used to filter byjob_type:

SELECT job.id, job.dag_id, job.state, job.job_type, job.start_date, job.end_date, job.latest_heartbeat, job.executor_class, job.hostname, job.unixname 
FROM job 
WHERE job.hostname = %(hostname_1)s AND job.job_type IN (__[POSTCOMPILE_job_type_1]) ORDER BY job.latest_heartbeat DESC 
 LIMIT %(param_1)s

While the new query does work, its inefficiency is can cause database IO to be throttled, resulting in liveness probe failures due to query timeouts.

Relevant Logs

Warning  Unhealthy         7m29s (x5 over 10m)    kubelet             Liveness probe failed: The SchedulerJob (id=97704647) for hostname 'airflow-scheduler-8586fc7d9f-br7vb' is not alive

Custom Helm Values

No response

thesuperzapper commented 7 months ago

@bitsofdave this was a great find!

It will be resolved by https://github.com/airflow-helm/charts/pull/853, which will be part of the 8.9.0 version of the chart.