Task gets stuck when the DB is unreachable

karakanb commented 2 years ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

I have been experiencing the tasks are getting stuck in running state forever when the DB is unreachable during the scheduling period for some reason.

My setup:

Airflow v2.3.4 on Kubernetes
Postgres as the DB + pgbouncer in front of it, managed by DigitalOcean

Here's a how it happened:

I have some hourly pipelines, all running various reporting tasks.
Right around the beginning of the hour on Oct 5th, around 10AM UTC, the database became unreachable for some reason.
The worker and scheduler instances start noticing the problem, the earliest 10:00:07 UTC.
The issue takes ~2 mins to resolve.
Around that time, a Dagrun is started for one of the hourly pipelines:

This task instance gets stuck until we have cleared it out 12 days later, today.

I have all the logs from all the Airflow components, but there is no mention of this specific tasks in the logs. There are many failure notifications. I'll share all the logs below, and in there you'll see some "ops-tracker` instances, but they are not from the instance that is stuck, they belong to a different pipeline. There is no logs / mention of the task instance that was stuck, nothing at all.

In the end, the task instance got stuck in a "running" state for 12 days, and it prevented running other dagruns because I use AIRFLOW__CORE__MAX_ACTIVE_RUNS_PER_DAG set to 1 intentionally.

The pipeline has the following default args:

default_args = {
        "depends_on_past": False,
        "retries": 3,
        "retry_delay": timedelta(seconds=30),
        "trigger_rule": TriggerRule.NONE_FAILED,
        "execution_timeout": timedelta(hours=24),
    }

The timeouts didn't help at all.

What you think should happen instead

The task should have either not been scheduled or marked as failed after some period.
Scheduling a task should be an atomic operation, and once it is scheduled it should be able to resolve itself, either as a failure or as a success.
Execution timeout should have been able to catch this failure at most after that time.

How to reproduce

I imagine this would be a very tricky setup to reproduce, but here's what I think happens:

Run airflow
Right in the beginning of scheduling, during that second, kill the DB connection
Hopefully, the same problem should occur

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==5.0.0 apache-airflow-providers-celery==3.0.0 apache-airflow-providers-cncf-kubernetes==4.3.0 apache-airflow-providers-common-sql==1.1.0 apache-airflow-providers-docker==3.1.0 apache-airflow-providers-elasticsearch==4.2.0 apache-airflow-providers-ftp==3.1.0 apache-airflow-providers-google==8.3.0 apache-airflow-providers-grpc==3.0.0 apache-airflow-providers-hashicorp==3.1.0 apache-airflow-providers-http==4.0.0 apache-airflow-providers-imap==3.0.0 apache-airflow-providers-microsoft-azure==4.2.0 apache-airflow-providers-microsoft-mssql==2.0.1 apache-airflow-providers-mysql==3.2.0 apache-airflow-providers-odbc==3.1.1 apache-airflow-providers-postgres==5.2.0 apache-airflow-providers-redis==3.0.0 apache-airflow-providers-sendgrid==3.0.0 apache-airflow-providers-sftp==4.0.0 apache-airflow-providers-slack==5.1.0 apache-airflow-providers-snowflake==2.7.0 apache-airflow-providers-sqlite==3.2.0 apache-airflow-providers-ssh==2.3.0

Deployment

Other 3rd-party Helm chart

Deployment details

Nothing, just Kubernetes + DigitalOcean Postgres with PgBouncer managed by DigitalOcean.

Anything else

This occurred on multiple pipelines on the same day at the same time.

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

karakanb commented 2 years ago

Attaching the logs here: logs.txt

potiuk commented 2 years ago

@karakanb can you please take a look at your history/monitoring if any of the components of Airlfow (including pgbouncer) have restarted around the time when it happened? If so, can you please detail the restart events that you saw? I am particularly interested if there was any scheduler restart. Did you attempt to restart scheduler manually to fix the problem?

From the logs you can see - there are multiple dag file processor "fatal" errors but no scheduler restart caused by the outage.

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "my-db-instance.b.db.ondigitalocean.com" (10.110.0.17), port 25061 failed: FATAL: pgbouncer cannot connect to server

I think that one needs looking at @dstandish @ephraimbuddy @ashb @uranusjr - I saw other people reporting similar issues when there is a temporary problem with the database and my guts feeling tell me that this is the classic "zombie db application" problem - where application kind of works and keeps connections but some of the transactions got "completed" status and the application "thinks" that the transaction was successful, but the database failure prevented it from actual flushing the changes to the disk.

Of course we cannot do much about it on the DB side and in running Airflow, but I'd say we should crash hard scheduler whenever any of the processes or subprocesses gets "FATAL" error like that. Arflow has built in mechanism to reconcile its state whenever it gets restarted, and if the database has problems, it will fail to restart (and wlll be restarted until the DB is back). So if my guess was right, just restarting scheduler should have eventually fix the problem.

If my guess is right - We can of course tell users to restart scheduler in such cases, but this kind of error might get unnoticed by the user so it would have been much better if we detect such fatal errors happening and simply crash scheduler when it happens. That would make self-healing after such catastrophic events.

Let me know what you think.

hterik commented 2 years ago

I've seen tasks getting stuck silently inside the airflow db check command, which is part of the Entrypoint of the airflow docker container. It has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 20, that get multiplied with your connect timeout which can be very long by default, maybe even infinite? I've seen examples where it get stuck hanging here for hours even after the DB is recovered.

If you use KubernetesExecutor, this will be the first thing happening whenever a task is started. It doesn't log anything before starting and immediately goes into probing the database for a very very long time.

See https://github.com/apache/airflow/blob/main/Dockerfile#L952

Another problem with the scheduler is that if one of the threads inside crash, the process still keeps running. You need to monitor the scheduler heartbeat from externally and restart the scheduler whenever it becomes unhealthy. This became a lot easier in 2.4 which now has a dedicated health-probe for scheduler. If this is the problem, it should be visible with a banner on the top of the web page.

potiuk commented 2 years ago

I've seen tasks getting stuck silently inside the airflow db check command, which is part of the Entrypoint of the airflow docker container. It has a loop both in the entrypoint itself, CONNECTION_CHECK_MAX_COUNT, set to 20, that get multiplied with your connect timeout which can be very long by default, maybe even infinite? I've seen examples where it get stuck hanging here for hours even after the DB is recovered.

Ideas on other strategies? What have you see working for you @hterik ? I think we can improve that - current defaults have been takenf from the original Astronomer image, but maybe we can do better? WDYT?

Another problem with the scheduler is that if one of the threads inside crash, the process still keeps running. You need to monitor the scheduler heartbeat from externally and restart the scheduler whenever it becomes unhealthy. This became a lot easier in 2.4 which now has a dedicated health-probe for scheduler. If this is the problem, it should be visible with a banner on the top of the web page.

This is interesting and should not (generally) happen. Do you have an example of that @hterik? IMHO that is exactly what is my point about "crashing hard whenever any crash occured. Seeing examples of when it happened would be super helpful (for reproduction and fix).

QuintenBruynseraede commented 1 year ago

I believe we are dealing with a very similar issue, in our setup the pgbouncer pod is receiving a SIGTERM signal every 6-12 hours from an external source. Is that also the case here? @karakanb

Obviously pgbouncer being unavailable for a minute (while it restarts) also drops the DB connection, like you describe.

Edit For future reference, in our case Kubernetes was occasionally restarting the pgbouncer pod to up/downscale our cluster (especially during periods of high load). We managed to mitigate the problem for now by annotating the pod with cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

karakanb commented 1 year ago

In my case I am using DigitalOcean's managed pgbouncer, therefore I do not have access to see if it is being restarted regularly or the timings of it unfortunately. All I could find was that the DB was not reachable and it caused this stuck state.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.

github-actions[bot] commented 2 weeks ago

This issue has been closed because it has not received response from the issue author.

apache / airflow