airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
630 stars 474 forks source link

Triggerer's async thread was blocked #830

Open WytzeBruinsma opened 4 months ago

WytzeBruinsma commented 4 months ago

Checks

Chart Version

8.8.0

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"fa3d7990104d7c1f16943a67f11b154b71f6a132", GitTreeState:"clean", BuildDate:"2023-07-19T12:20:54Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.7", GitCommit:"55a7e688f9220adca1c99b7903953911dd38b771", GitTreeState:"clean", BuildDate:"2023-11-03T12:18:23Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"}

Helm Version

version.BuildInfo{Version:"v3.12.3", GitCommit:"3a31588ad33fe3b89af5a2a54ee1d25bfe6eaa5e", GitTreeState:"clean", GoVersion:"go1.20.7"}

Description

The airflow pod trigger is raising errors and is slowing down Airflow processes. The error is; Triggerer's async thread was blocked for 0.23 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.. I tried resolving this by increasing the resources, but even after removing all the limits and giving it 10 GB RAM and lots of CPU head room it still raises this error. I also check the response times of the Postgres database and couldn't find any thing that could slow down the async process and cause this error. Please let me know what other steps I can do to resolve this error.

Relevant Logs

2024-02-23 01:48:09.268 
[2024-02-23T00:48:09.267+0000] {triggerer_job_runner.py:573} INFO - Triggerer's async thread was blocked for 0.23 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.
2024-02-23 09:00:44.327 
[2024-02-23T08:00:44.325+0000] {triggerer_job_runner.py:573} INFO - Triggerer's async thread was blocked for 0.38 seconds, likely by a badly-written trigger. Set PYTHONASYNCIODEBUG=1 to get more information on overrunning coroutines.

Custom Helm Values

No response

justplanenutz commented 3 weeks ago

We are seeing this as well, although the time values are a bit longer. Is there a tolerance var we can set to make the aysnc process a bit less time sensitive?

thesuperzapper commented 3 weeks ago

@justplanenutz @WytzeBruinsma you should check the code of the trigger you are using, the message is probably correct that something is wrong with it. Did you write it yourself, or are you using one from the official providers?

Also, please note that this is an INFO-level log, so it's probably cosmetic. Are you seeing any issues related to it?

For your reference, here is the code (in airflow itself) that detects this condition and writes the log:

https://github.com/apache/airflow/blob/2.9.2/airflow/jobs/triggerer_job_runner.py#L557-L582

justplanenutz commented 3 weeks ago

@thesuperzapper Trigger has been running fine in the past so we're confident that the code is sound. We normally set our logs at INFO as well, filtering them is not the issue.
We are currently looking for any deltas in the code base that may have aggravated an edge condition.

justplanenutz commented 3 weeks ago

Our trigger process is running in kubernetes and we have collected metrics for CPU and Memory usage. We noticed a significant increase in CPU and Memory consumption just before the problems started. When we restart the pod it's all good... so maybe a resource leak of some kind?

thesuperzapper commented 3 weeks ago

@justplanenutz In any case, it's very unlikely to be related to this chart.

You should probably raise an issue upstream if you figure out what was causing it, feel free to link it here if you do.