Closed karakanb closed 2 weeks ago
@karakanb can you please take a look at your history/monitoring if any of the components of Airlfow (including pgbouncer) have restarted around the time when it happened? If so, can you please detail the restart events that you saw? I am particularly interested if there was any scheduler restart. Did you attempt to restart scheduler manually to fix the problem?
From the logs you can see - there are multiple dag file processor "fatal" errors but no scheduler restart caused by the outage.
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "my-db-instance.b.db.ondigitalocean.com" (10.110.0.17), port 25061 failed: FATAL: pgbouncer cannot connect to server
I think that one needs looking at @dstandish @ephraimbuddy @ashb @uranusjr - I saw other people reporting similar issues when there is a temporary problem with the database and my guts feeling tell me that this is the classic "zombie db application" problem - where application kind of works and keeps connections but some of the transactions got "completed" status and the application "thinks" that the transaction was successful, but the database failure prevented it from actual flushing the changes to the disk.
Of course we cannot do much about it on the DB side and in running Airflow, but I'd say we should crash hard scheduler whenever any of the processes or subprocesses gets "FATAL" error like that. Arflow has built in mechanism to reconcile its state whenever it gets restarted, and if the database has problems, it will fail to restart (and wlll be restarted until the DB is back). So if my guess was right, just restarting scheduler should have eventually fix the problem.
If my guess is right - We can of course tell users to restart scheduler in such cases, but this kind of error might get unnoticed by the user so it would have been much better if we detect such fatal errors happening and simply crash scheduler when it happens. That would make self-healing after such catastrophic events.
Let me know what you think.
I've seen tasks getting stuck silently inside the airflow db check
command, which is part of the Entrypoint of the airflow docker container. It has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 20, that get multiplied with your connect timeout which can be very long by default, maybe even infinite? I've seen examples where it get stuck hanging here for hours even after the DB is recovered.
If you use KubernetesExecutor, this will be the first thing happening whenever a task is started. It doesn't log anything before starting and immediately goes into probing the database for a very very long time.
See https://github.com/apache/airflow/blob/main/Dockerfile#L952
Another problem with the scheduler is that if one of the threads inside crash, the process still keeps running. You need to monitor the scheduler heartbeat from externally and restart the scheduler whenever it becomes unhealthy. This became a lot easier in 2.4 which now has a dedicated health-probe for scheduler. If this is the problem, it should be visible with a banner on the top of the web page.
I've seen tasks getting stuck silently inside the airflow db check command, which is part of the Entrypoint of the airflow docker container. It has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 20, that get multiplied with your connect timeout which can be very long by default, maybe even infinite? I've seen examples where it get stuck hanging here for hours even after the DB is recovered.
Ideas on other strategies? What have you see working for you @hterik ? I think we can improve that - current defaults have been takenf from the original Astronomer image, but maybe we can do better? WDYT?
Another problem with the scheduler is that if one of the threads inside crash, the process still keeps running. You need to monitor the scheduler heartbeat from externally and restart the scheduler whenever it becomes unhealthy. This became a lot easier in 2.4 which now has a dedicated health-probe for scheduler. If this is the problem, it should be visible with a banner on the top of the web page.
This is interesting and should not (generally) happen. Do you have an example of that @hterik? IMHO that is exactly what is my point about "crashing hard whenever any crash occured. Seeing examples of when it happened would be super helpful (for reproduction and fix).
I believe we are dealing with a very similar issue, in our setup the pgbouncer pod is receiving a SIGTERM signal every 6-12 hours from an external source. Is that also the case here? @karakanb
Obviously pgbouncer being unavailable for a minute (while it restarts) also drops the DB connection, like you describe.
Edit
For future reference, in our case Kubernetes was occasionally restarting the pgbouncer pod to up/downscale our cluster (especially during periods of high load). We managed to mitigate the problem for now by annotating the pod with cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
In my case I am using DigitalOcean's managed pgbouncer, therefore I do not have access to see if it is being restarted regularly or the timings of it unfortunately. All I could find was that the DB was not reachable and it caused this stuck state.
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
I have been experiencing the tasks are getting stuck in running state forever when the DB is unreachable during the scheduling period for some reason.
My setup:
Here's a how it happened:
This task instance gets stuck until we have cleared it out 12 days later, today.
I have all the logs from all the Airflow components, but there is no mention of this specific tasks in the logs. There are many failure notifications. I'll share all the logs below, and in there you'll see some "ops-tracker` instances, but they are not from the instance that is stuck, they belong to a different pipeline. There is no logs / mention of the task instance that was stuck, nothing at all.
In the end, the task instance got stuck in a "running" state for 12 days, and it prevented running other dagruns because I use
AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG
set to1
intentionally.The pipeline has the following default args:
The timeouts didn't help at all.
What you think should happen instead
How to reproduce
I imagine this would be a very tricky setup to reproduce, but here's what I think happens:
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==5.0.0 apache-airflow-providers-celery==3.0.0 apache-airflow-providers-cncf-kubernetes==4.3.0 apache-airflow-providers-common-sql==1.1.0 apache-airflow-providers-docker==3.1.0 apache-airflow-providers-elasticsearch==4.2.0 apache-airflow-providers-ftp==3.1.0 apache-airflow-providers-google==8.3.0 apache-airflow-providers-grpc==3.0.0 apache-airflow-providers-hashicorp==3.1.0 apache-airflow-providers-http==4.0.0 apache-airflow-providers-imap==3.0.0 apache-airflow-providers-microsoft-azure==4.2.0 apache-airflow-providers-microsoft-mssql==2.0.1 apache-airflow-providers-mysql==3.2.0 apache-airflow-providers-odbc==3.1.1 apache-airflow-providers-postgres==5.2.0 apache-airflow-providers-redis==3.0.0 apache-airflow-providers-sendgrid==3.0.0 apache-airflow-providers-sftp==4.0.0 apache-airflow-providers-slack==5.1.0 apache-airflow-providers-snowflake==2.7.0 apache-airflow-providers-sqlite==3.2.0 apache-airflow-providers-ssh==2.3.0
Deployment
Other 3rd-party Helm chart
Deployment details
Nothing, just Kubernetes + DigitalOcean Postgres with PgBouncer managed by DigitalOcean.
Anything else
This occurred on multiple pipelines on the same day at the same time.
Are you willing to submit PR?
Code of Conduct