apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.37k stars 14.11k forks source link

scheduler gets stuck without a trace #7935

Closed dimberman closed 3 years ago

dimberman commented 4 years ago

Apache Airflow version:

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment:

The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.

Scheduler configs, job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 executor = LocalExecutor parallelism = 32

Please help. I would be happy to provide any other information needed

What you expected to happen:

How to reproduce it:

Anything else we need to know:

Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401

abhijit-kottur commented 4 years ago

I'm running Airflow 1.10.4 as Celery in k8s. The scheduler pod is getting stuck while starting up at the step 'Resetting orphaned tasks'.

[2020-03-31 19:34:36,955] {{__init__.py:51}} INFO - Using executor CeleryExecutor
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
[2020-03-31 19:34:37,533] {{scheduler_job.py:1288}} INFO - Starting the scheduler
[2020-03-31 19:34:37,534] {{scheduler_job.py:1296}} INFO - Running execute loop for -1 seconds
[2020-03-31 19:34:37,534] {{scheduler_job.py:1297}} INFO - Processing each file at most -1 times
[2020-03-31 19:34:37,535] {{scheduler_job.py:1300}} INFO - Searching for files in /usr/local/airflow/dags
[2020-03-31 19:34:38,124] {{scheduler_job.py:1302}} INFO - There are 39 files in /usr/local/airflow/dags
[2020-03-31 19:34:38,124] {{scheduler_job.py:1349}} INFO - Resetting orphaned tasks for active dag runs

This causes the UI to say

The scheduler does not appear to be running. Last heartbeat was received 5 minutes ago

The same thing happens even after restarting the scheduler pod. (regardless of the CPU usage)

Any leads to solve this?

mik-laj commented 4 years ago

What database are you using?

abhijit-kottur commented 4 years ago

@mik-laj PostgreSQL. Thats running as a pod too.

NiGhtFurRy commented 4 years ago

We are also facing Scheduler stuck issue which sometimes gets resolved by restarting the scheduler pod. There are not log trace in the scheduler process. We are using airflow 1.10.9 with postgres and redis.

leerobert commented 4 years ago

We're also seeing this same issue... no idea how to debug. airflow 1.10.9 with postgres / rabbitmq

chrismclennon commented 4 years ago

I see a similar issue on 1.10.9 where the scheduler runs fine on start but typically after 10 to 15 days the CPU utilization actually drops to near 0%. The scheduler health check in the webserver does still pass, but no jobs get scheduled. A restart fixes this.

Seeing as I observe a CPU drop instead of a CPU spike, I'm not sure if these are the same issues, but they share symptoms.

gmcoringa commented 4 years ago

I see a similar issue on 1.10.10... there are no logs to indicate the problem. Airflow with mysql, redis and celery executor.

PS: we still run the scheduler with the arguments -n 10

chrismclennon commented 4 years ago

I've anecdotally noticed that once I've dropped argument -n 25 from our scheduler invocation, I haven't seen this issue come up since. Before, it would crop up every ~10 days or so and it's been about a month now without incident.

mik-laj commented 4 years ago

Could someone try to run pyspy when this incident occurs? This may bring us to a solution. Thanks to this, we will be able to check what code is currently being executed without restarting the application. https://github.com/benfred/py-spy

sylr commented 4 years ago
root@airflow-scheduler-5b76d7466f-dxdn2:/usr/local/airflow# ps auxf
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       5229  0.5  0.0  19932  3596 pts/0    Ss   13:25   0:00 bash
root       5234  0.0  0.0  38308  3376 pts/0    R+   13:25   0:00  \_ ps auxf
root          1  2.7  0.6 847400 111092 ?       Ssl  12:48   1:01 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root         19  0.7  0.5 480420 86124 ?        S    12:48   0:16 airflow scheduler -- DagFileProcessorManager
root       5179  0.1  0.0      0     0 ?        Z    13:17   0:00  \_ [airflow schedul] <defunct>
root       5180  0.1  0.0      0     0 ?        Z    13:17   0:00  \_ [airflow schedul] <defunct>
root       5135  0.0  0.5 847416 96960 ?        S    13:17   0:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root       5136  0.0  0.0      0     0 ?        Z    13:17   0:00 [/usr/local/bin/] <defunct>
Collecting samples from 'airflow scheduler -- DagFileProcessorManager' (python v3.7.8)
Total Samples 3106
GIL: 0.00%, Active: 1.00%, Threads: 1

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
  1.00%   1.00%   0.200s    0.200s   _send (multiprocessing/connection.py:368)
  0.00%   1.00%   0.000s    0.200s   start (airflow/utils/dag_processing.py:554)
  0.00%   1.00%   0.000s    0.200s   wrapper (airflow/utils/cli.py:75)
  0.00%   1.00%   0.000s    0.200s   _run_processor_manager (airflow/utils/dag_processing.py:624)
  0.00%   1.00%   0.000s    0.200s   run (airflow/jobs/base_job.py:221)
  0.00%   1.00%   0.000s    0.200s   _Popen (multiprocessing/context.py:277)
  0.00%   1.00%   0.000s    0.200s   <module> (airflow:37)
  0.00%   1.00%   0.000s    0.200s   _send_bytes (multiprocessing/connection.py:404)
  0.00%   1.00%   0.000s    0.200s   _launch (multiprocessing/popen_fork.py:74)
  0.00%   1.00%   0.000s    0.200s   scheduler (airflow/bin/cli.py:1040)
  0.00%   1.00%   0.000s    0.200s   send (multiprocessing/connection.py:206)
  0.00%   1.00%   0.000s    0.200s   start (airflow/utils/dag_processing.py:861)
  0.00%   1.00%   0.000s    0.200s   _Popen (multiprocessing/context.py:223)
  0.00%   1.00%   0.000s    0.200s   _execute_helper (airflow/jobs/scheduler_job.py:1415)
  0.00%   1.00%   0.000s    0.200s   _bootstrap (multiprocessing/process.py:297)
  0.00%   1.00%   0.000s    0.200s   _execute (airflow/jobs/scheduler_job.py:1382)
  0.00%   1.00%   0.000s    0.200s   start (multiprocessing/process.py:112)
  0.00%   1.00%   0.000s    0.200s   run (multiprocessing/process.py:99)
  0.00%   1.00%   0.000s    0.200s   __init__ (multiprocessing/popen_fork.py:20)
sylr commented 4 years ago

Happened again today

root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid=18 --nonblocking
Process 18: airflow scheduler -- DagFileProcessorManager
Python v3.7.8 (/usr/local/bin/python3.7)

Thread 0x7F1E7B360700 (active): "MainThread"
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:404)
    send (multiprocessing/connection.py:206)
    start (airflow/utils/dag_processing.py:886)
    _run_processor_manager (airflow/utils/dag_processing.py:624)
    run (multiprocessing/process.py:99)
    _bootstrap (multiprocessing/process.py:297)
    _launch (multiprocessing/popen_fork.py:74)
    __init__ (multiprocessing/popen_fork.py:20)
    _Popen (multiprocessing/context.py:277)
    _Popen (multiprocessing/context.py:223)
    start (multiprocessing/process.py:112)
    start (airflow/utils/dag_processing.py:554)
    _execute_helper (airflow/jobs/scheduler_job.py:1415)
    _execute (airflow/jobs/scheduler_job.py:1382)
    run (airflow/jobs/base_job.py:221)
    scheduler (airflow/bin/cli.py:1040)
    wrapper (airflow/utils/cli.py:75)
    <module> (airflow:37)
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid=18 --native
Process 18: airflow scheduler -- DagFileProcessorManager
Python v3.7.8 (/usr/local/bin/python3.7)

Thread 18 (idle): "MainThread"
    __write (libpthread-2.24.so)
    _send (multiprocessing/connection.py:368)
    _send_bytes (multiprocessing/connection.py:404)
    send (multiprocessing/connection.py:206)
    start (airflow/utils/dag_processing.py:886)
    _run_processor_manager (airflow/utils/dag_processing.py:624)
    run (multiprocessing/process.py:99)
    _bootstrap (multiprocessing/process.py:297)
    _launch (multiprocessing/popen_fork.py:74)
    __init__ (multiprocessing/popen_fork.py:20)
    _Popen (multiprocessing/context.py:277)
    _Popen (multiprocessing/context.py:223)
    start (multiprocessing/process.py:112)
    start (airflow/utils/dag_processing.py:554)
    _execute_helper (airflow/jobs/scheduler_job.py:1415)
    _execute (airflow/jobs/scheduler_job.py:1382)
    run (airflow/jobs/base_job.py:221)
    scheduler (airflow/bin/cli.py:1040)
    wrapper (airflow/utils/cli.py:75)
    <module> (airflow:37)

@mik-laj does it help ?

sylr commented 4 years ago

Ok so I have more info, here the situation when the scheduler gets stuck:

root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# ps auxf
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       6040  0.0  0.0  19936  3964 pts/0    Ss   20:18   0:00 bash
root       6056  0.0  0.0  38308  3140 pts/0    R+   20:19   0:00  \_ ps auxf
root          1  2.9  0.7 851904 115828 ?       Ssl  Jul30  54:46 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root         18  0.9  0.5 480420 86616 ?        S    Jul30  18:20 airflow scheduler -- DagFileProcessorManager
root       6020  0.1  0.0      0     0 ?        Z    20:08   0:00  \_ [airflow schedul] <defunct>
root       6021  0.1  0.0      0     0 ?        Z    20:08   0:00  \_ [airflow schedul] <defunct>
root       5977  0.0  0.6 851920 100824 ?       S    20:08   0:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root       5978  0.0  0.6 851920 100424 ?       S    20:08   0:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1

I managed to revive the scheduler by killing both 5977 & 5978 pids.

root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid 5977
Process 5977: /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
Python v3.7.8 (/usr/local/bin/python3.7)

Thread 5977 (idle): "MainThread"
    _flush_std_streams (multiprocessing/util.py:435)
    _bootstrap (multiprocessing/process.py:317)
    _launch (multiprocessing/popen_fork.py:74)
    __init__ (multiprocessing/popen_fork.py:20)
    _Popen (multiprocessing/context.py:277)
    start (multiprocessing/process.py:112)
    _repopulate_pool (multiprocessing/pool.py:241)
    __init__ (multiprocessing/pool.py:176)
    Pool (multiprocessing/context.py:119)
    sync (airflow/executors/celery_executor.py:247)
    heartbeat (airflow/executors/base_executor.py:134)
    _validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1505)
    _execute_helper (airflow/jobs/scheduler_job.py:1443)
    _execute (airflow/jobs/scheduler_job.py:1382)
    run (airflow/jobs/base_job.py:221)
    scheduler (airflow/bin/cli.py:1040)
    wrapper (airflow/utils/cli.py:75)
    <module> (airflow:37)

root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid 5978
Process 5978: /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
Python v3.7.8 (/usr/local/bin/python3.7)

Thread 5978 (idle): "MainThread"
    _flush_std_streams (multiprocessing/util.py:435)
    _bootstrap (multiprocessing/process.py:317)
    _launch (multiprocessing/popen_fork.py:74)
    __init__ (multiprocessing/popen_fork.py:20)
    _Popen (multiprocessing/context.py:277)
    start (multiprocessing/process.py:112)
    _repopulate_pool (multiprocessing/pool.py:241)
    __init__ (multiprocessing/pool.py:176)
    Pool (multiprocessing/context.py:119)
    sync (airflow/executors/celery_executor.py:247)
    heartbeat (airflow/executors/base_executor.py:134)
    _validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1505)
    _execute_helper (airflow/jobs/scheduler_job.py:1443)
    _execute (airflow/jobs/scheduler_job.py:1382)
    run (airflow/jobs/base_job.py:221)
    scheduler (airflow/bin/cli.py:1040)
    wrapper (airflow/utils/cli.py:75)
    <module> (airflow:37)
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# kill -9 5978
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# kill -9 5977
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# ps auxf
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       6040  0.0  0.0  19936  3964 pts/0    Ss   20:18   0:00 bash
root       6071  0.0  0.0  38308  3176 pts/0    R+   20:21   0:00  \_ ps auxf
root          1  2.9  0.7 851904 115828 ?       Ssl  Jul30  54:46 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root         18  0.9  0.5 480420 86616 ?        S    Jul30  18:20 airflow scheduler -- DagFileProcessorManager
root       6069  0.0  0.5 485184 87268 ?        R    20:21   0:00  \_ airflow scheduler - DagFileProcessor /usr/local/airflow/dags/datafactory-kafka2adls-link-1.py
root       6070  0.0  0.5 483640 85684 ?        R    20:21   0:00  \_ airflow scheduler - DagFileProcessor /usr/local/airflow/dags/datafactory-kafka2adls-sfdc-history-1.py
norwoodj commented 4 years ago

We also have this issue: Apache Airflow version: 1.10.10

Kubernetes version (if you are using kubernetes) (use kubectl version): v1.14.10-gke.42

Environment:

Cloud provider or hardware configuration: Google Cloud Kubernetes OS (e.g. from /etc/os-release): "Debian GNU/Linux 10 (buster)" Kernel (e.g. uname -a): Linux airflow-scheduler-77fc4ff87c-k2td5 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 GNU/Linux Install tools: Others: What happened: After running correctly for one to a few hours the scheduler simply stops scheduling tasks. No errors appear in any airflow logs (worker and web included). I see CPU go down when it hits the stopping point. We are using postgres/redis

sglickman commented 4 years ago

This is happening to us also. No errors appear in the logs but the scheduler will not create new pods, pipelines stall with tasks in 'queued' state, and the scheduler pod must be deleted in order to get things running again.

pingdink commented 4 years ago

Any fix for this issue yet? Our scheduler has no heartbeat, CPU spikes then drops, and scheduler is back up after 15 minutes. This is slowing our team down a lot.

ashwinshankar77 commented 4 years ago

Hi, this is happening at Slack too. We are using celery executor. The scheduler just gets stuck, no trace in the logs. Seeing a lot of defunct processes. Restart fixes it. @turbaszek @kaxil @potiuk any ideas what is going on?

msumit commented 4 years ago

We are also facing the same issue with the Airflow 1.10.4 - Mysql - Celery combination. Found that Schedule - DagFileProcessorManager gets hung and we've to kill that to get the scheduler back.

ashwinshankar77 commented 4 years ago

@msumit I see the exact same symptom. Please let us know if you find something.

sdzharkov commented 4 years ago

We've experienced this issue twice now, with the CPU spiking to 100% and failing to schedule any tasks after. Our config is Airflow 1.10.6 - Celery - Postgres running on AWS ECS. I went back into our Cloudwatch logs and noticed the following collection of logs at the time the bug occurred:

  | 2020-07-20T07:21:21.346Z | Process DagFileProcessor4357938-Process:
  | 2020-07-20T07:21:21.346Z | Traceback (most recent call last):
  | 2020-07-20T07:21:21.346Z | File "/usr/local/lib/python3.7/logging/__init__.py", line 1029, in emit
  | 2020-07-20T07:21:21.346Z | self.flush()
  | 2020-07-20T07:21:21.346Z | File "/usr/local/lib/python3.7/logging/__init__.py", line 1009, in flush
  | 2020-07-20T07:21:21.346Z | self.stream.flush()
  | 2020-07-20T07:21:21.346Z | OSError: [Errno 28] No space left on device
  | 2020-07-20T07:21:21.346Z | During handling of the above exception, another exception occurred:

Which would point to the scheduler running out of memory, likely due to log buildup (I added log cleanup tasks retroactively). I'm not sure if this is related to the scheduler getting stuck though.

dlamblin commented 4 years ago

Is disk space everyone's issue? I recall either v 1.10.5 or v 1.10.6 had some not-fit-for-production use issue that was fixed in the next version. 1.10.9 has been working okay for us and importantly -n > -1 is not recommended anymore.

I'm curious if you could work around it with AIRFLOW__CORE__BASE_LOG_FOLDER=/dev/null (probably not because it tries to make sub-dirs right)?

In the meantime we have a systemd timer service (or you use cron) that runs basically (gnu) find:

find <base_log_dir> -mindepth 2 -type f -mtime +6 -delete -or -type d -empty -delete

E.G.

$ tree -D dir/
dir/
└── [Sep  6 23:10]  dir
    ├── [Sep  6 23:10]  dir
    │   └── [Jan  1  2020]  file.txt
    ├── [Sep  6 23:09]  diry
    └── [Sep  6 23:10]  dirz
        └── [Sep  6 23:10]  file.txt

4 directories, 2 files
$ find dir -mindepth 2 -type f -mtime +6 -delete -or -type d -empty -delete
$ tree -D dir/
dir/
└── [Sep  6 23:13]  dir
    └── [Sep  6 23:10]  dirz
        └── [Sep  6 23:10]  file.txt

2 directories, 1 file
msumit commented 4 years ago

All system vitals like the disk, cpu, and mem are absolutely fine whenever the stuck happens for us. Whenever the process stuck, it doesn't respond to any other kill signals except 9 & 11.

I did a strace on the stuck process, it shows the following futex(0x14d9390, FUTEX_WAIT_PRIVATE, 0, NULL

Then I killed the process with kill -11 and loaded the core in gdb, and below is the stack trace

(gdb) bt

0 0x00007fe49b18b49b in raise () from /lib64/libpthread.so.0

1

2 0x00007fe49b189adb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0

3 0x00007fe49b189b6f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0

4 0x00007fe49b189c0b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0

5 0x0000000000430bc5 in PyThread_acquire_lock_timed ()

6 0x0000000000521a4c in acquire_timed ()

7 0x0000000000521af6 in rlock_acquire ()

8 0x00000000004826cd in _PyCFunction_FastCallDict ()

9 0x00000000004f4143 in call_function ()

10 0x00000000004f7971 in _PyEval_EvalFrameDefault ()

11 0x00000000004f33c0 in _PyFunction_FastCall ()

12 0x00000000004f40d6 in call_function ()

13 0x00000000004f7971 in _PyEval_EvalFrameDefault ()

14 0x00000000004f33c0 in _PyFunction_FastCall ()

15 0x00000000004f40d6 in call_function ()

16 0x00000000004f7971 in _PyEval_EvalFrameDefault ()

17 0x00000000004f33c0 in _PyFunction_FastCall ()

18 0x00000000004f40d6 in call_function ()

19 0x00000000004f7971 in _PyEval_EvalFrameDefault ()

20 0x00000000004f33c0 in _PyFunction_FastCall ()

21 0x00000000004f40d6 in call_function ()

norwoodj commented 3 years ago

If it helps, the last time this happened, with debug logging on, the scheduler logs this: ending.log before freezing forever and never heartbeating again

mik-laj commented 3 years ago

https://github.com/apache/airflow/pull/11306 This change improves scheduling process management a little and may help us. Could you check it?

teastburn commented 3 years ago

We also are experiencing a similar issue at Nextdoor with 1.10.12 / Postgres / Celery / AWS ECS. Ours looks much like @sylr 's post https://github.com/apache/airflow/issues/7935#issuecomment-667343505 where we have many extra processes spawned that by program args appear identical to the scheduler main process and everything is stuck. However, ours has CPU go to 0 and RAM spike up quite high.

teastburn commented 3 years ago

We have a change that correlates (causation is not yet verified) to fixing the issue the @sylr mentioned here where many scheduler main processes spawn at the same time then disappear (which caused an OOM error for us).

The change was the following:

AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE
- 5
+ 11
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW
- 10
+ 30
AIRFLOW__CORE__SQL_ALCHEMY_POOL_RECYCLE
- 3600
+ 1800

And we run MAX_THREADS=10. Is it possible that reaching pool_size or pool_size+max_overflow caused processes to back up or spawn oddly? Before this change, the scheduler was getting stuck 1-2 times per day, now we have not seen this issue since the change 6 days ago.

We do not see the issue of many processes spawning at once anymore like this:

``` $ while true; do pgrep -f 'airflow scheduler' | wc -l; sleep .5; done 39 4 4 4 39 39 39 39 39 5 5 5 5 5 5 5 3 3 3 38 3 3 2 2 2 2 2 37 2 2 2 2 2 2 2 7 2 8 3 8 2 4 3 3 3 3 2 2 2 2 2 2 2 2 4 3 3 3 9 3 3 3 13 3 3 3 17 2 2 2 2 2 2 2 24 2 2 4 ```

Can anyone else verify this change helps or not?

dispensable commented 3 years ago

Same issue here with 1.10.12 + rabbitmq + celery + k8s. The scheduler keeps logging [2020-10-23 08:10:21,387] {{scheduler_job.py:237}} WARNING - Killing PID 30918 while in the container side shows [airflow schedul] <defunct> generated by airflow scheduler - DagFileProcessor <example_dag.py> over and over again. And the scheduler just get stuck and never schedule any tasks.

duyet commented 3 years ago

Have you tried to cat the {AIRFLOW_HOME}/logs/dag_processor_manager/dag_processor_manager.log in the scheduler pod? It may help, maybe the DAG Processor timeout.

norwoodj commented 3 years ago

@teastburn we tried these settings and it did not fix things. @duyet I tailed those logs and didn't see anything out of the ordinary, the logs just... stop. It's happening now, so any debugging info you'd like me to take I can do: Screen Shot 2020-10-23 at 14 40 03.

We use airflow here at Cloudflare to run a couple hundred jobs a day, and this has become a major issue for us. It is very difficult for us to downgrade to a version older than 1.10 and ever since we upgraded this has been a persistent and very annoying issue. Every 3-6 hours, every day for the past 4 months, the scheduler just stops running. The only "solution" we've found is to run a cronjob that kills the scheduler pod every 6 hours. And that leaves a ton of dangling tasks around, it is not a permanent or even really a workable solution.

I'm happy to debug as much as possible, I've tried digging into the code myself as well, but I simply don't have the familiarity to figure out what's going wrong without a significant time investment. Any help y'all can give us would be massively appreciated. At this point we're considering dropping airflow. We simply can't continue working with such a flaky platform.

duyet commented 3 years ago

@norwoodj you can also set log level to DEBUG. I used to stuck with Airflow scheduler because the processing is timeout 50s by default. I got the same problem with the scheduler stop working. Airflow scheduler also have the run_duration to restart the Scheduler automatically after a given amount of time.

https://github.com/apache/airflow/blob/1.10.12/airflow/config_templates/default_airflow.cfg#L643

mik-laj commented 3 years ago
\_ [airflow schedul] <defunct>

This looks like a Python bug. I have already used a workaround in one place to fix a similar problem. I think it would be worth checking if we do not have a similar problem in DagFileProcessor. See: https://github.com/apache/airflow/pull/11734

mik-laj commented 3 years ago
__init__ (multiprocessing/pool.py:176)
Pool (multiprocessing/context.py:119)
sync (airflow/executors/celery_executor.py:247)

This change recently improved the use of Pool in the celery of an executor. I think it's also worth checking out. https://github.com/apache/airflow/pull/11278/files

michaelosthege commented 3 years ago

I had ~4800 tasks from the same DAG stuck after a manual reset, with the scheduler just killing PIDs. Turning other DAGs off and increasing DAGBAG_IMPORT_TIMEOUT did not help. Also restarting webserver/scheduler/redis/mysql had no effect.

After setting the "running" dagruns with the stuck tasks to "failed" and then back to "running" in smaller batches the scheduler managed to queue them.

(Airflow 1.10.10 with Celery)

maijh commented 3 years ago
__init__ (multiprocessing/pool.py:176)
Pool (multiprocessing/context.py:119)
sync (airflow/executors/celery_executor.py:247)

This change recently improved the use of Pool in the celery of an executor. I think it's also worth checking out. https://github.com/apache/airflow/pull/11278/files

Can the above link solve the problem? @mik-laj

mik-laj commented 3 years ago

@maijh I do not have the capacity to reproduce this bug, but I am sharing tips on what could be causing the bug as I am watching all community activity.

ashb commented 3 years ago

I had ~4800 tasks from the same DAG stuck after a manual reset, with the scheduler just killing PIDs. Turning other DAGs off and increasing DAGBAG_IMPORT_TIMEOUT did not help. Also restarting webserver/scheduler/redis/mysql had no effect.

After setting the "running" dagruns with the stuck tasks to "failed" and then back to "running" in smaller batches the scheduler managed to queue them.

(Airflow 1.10.10 with Celery)

@michaelosthege This behaviour should be fixed in 2.0.0 (now in beta stages) thanks to https://github.com/apache/airflow/pull/10956

norwoodj commented 3 years ago

@dlamblin why was this closed? My read of this most recent comment was that it described a different issue than the one this issue refers to, and @ashb was pointing out that that bug was fixed, not necessarily the underlying one that this issue references.

If this issue is also fixed by that pull request, then great. I just want to be sure this issue isn't being closed by mistake because this is still a huge issue for us.

ashb commented 3 years ago

TBC this issue should be fixed in 2.0.0 as we have massively reworked the Scheduler, but let's leave it open until we have confirmation

maijh commented 3 years ago

TBC this issue should be fixed in 2.0.0 as we have massively reworked the Scheduler, but let's leave it open until we have confirmation

this issue have been fixed in version 2.0.0 or will be fixed in the future version 2.x.x ? @ashb

ashb commented 3 years ago

I'm hopeful that #10956, which is already merged, will fix this issue, and they will be included in 2.0.0

norwoodj commented 3 years ago

Any idea when 2.0 will be released?

potiuk commented 3 years ago

https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+2.0+-+Planning

krisdock commented 3 years ago

any confirmation yet on whether this is fixed in 2.0?

ashb commented 3 years ago

Sadly not - not entirely anyway. It should be much better, but there might still be some cases where the scheduler doesn't behave itself, but from the reports we've had, it's much more binary (either it's working or just it never schedules anything from start) rather than working for a while then stopping to do anything.

BwL1289 commented 3 years ago

Commenting to track this thread.

ashb commented 3 years ago

@BwL1289 (and others) if you are hitting this problem, please can you let me know what versions of Airflow you see this on?

fjmacagno commented 3 years ago

Seeing this on 1.10.14

maijh commented 3 years ago

Seeing this on 1.10.9

maijh commented 3 years ago

Seeing this on 1.10.9

milton0825 commented 3 years ago

Seeing this on 1.10.8 with Celery executor. We are running the scheduler with num duration 900 seconds. It would run fine for a couple of days then suddenly just freeze.

Thread 1 (idle): "MainThread"
    wait (threading.py:295)
    wait (threading.py:551)
    wait (multiprocessing/pool.py:635)
    get (multiprocessing/pool.py:638)
    map (multiprocessing/pool.py:266)
    trigger_tasks (lyft_etl/airflow/executors/lyft_celery_executor.py:90)
    heartbeat (airflow/executors/base_executor.py:130)
    _validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1536)
    _execute_helper (airflow/jobs/scheduler_job.py:1473)
    _execute (airflow/jobs/scheduler_job.py:1412)
    run (airflow/jobs/base_job.py:221)
    scheduler (airflow/bin/cli.py:1117)
    wrapper (airflow/utils/cli.py:75)
    <module> (airflow/bin/airflow:37)
    <module> (airflow:7)
Thread 97067 (idle): "Thread-5667"
    _handle_workers (multiprocessing/pool.py:406)
    run (threading.py:864)
    _bootstrap_inner (threading.py:916)
    _bootstrap (threading.py:884)
Thread 97068 (idle): "Thread-5668"
    wait (threading.py:295)
dimberman commented 3 years ago

@ashb perhaps there is somewhere in the scheduler loop where there is a race condition? Would be interesting to see this same thread trace on 2.0.