Closed dimberman closed 3 years ago
I'm running Airflow 1.10.4 as Celery in k8s. The scheduler pod is getting stuck while starting up at the step 'Resetting orphaned tasks'.
[2020-03-31 19:34:36,955] {{__init__.py:51}} INFO - Using executor CeleryExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
[2020-03-31 19:34:37,533] {{scheduler_job.py:1288}} INFO - Starting the scheduler
[2020-03-31 19:34:37,534] {{scheduler_job.py:1296}} INFO - Running execute loop for -1 seconds
[2020-03-31 19:34:37,534] {{scheduler_job.py:1297}} INFO - Processing each file at most -1 times
[2020-03-31 19:34:37,535] {{scheduler_job.py:1300}} INFO - Searching for files in /usr/local/airflow/dags
[2020-03-31 19:34:38,124] {{scheduler_job.py:1302}} INFO - There are 39 files in /usr/local/airflow/dags
[2020-03-31 19:34:38,124] {{scheduler_job.py:1349}} INFO - Resetting orphaned tasks for active dag runs
This causes the UI to say
The scheduler does not appear to be running. Last heartbeat was received 5 minutes ago
The same thing happens even after restarting the scheduler pod. (regardless of the CPU usage)
Any leads to solve this?
What database are you using?
@mik-laj PostgreSQL. Thats running as a pod too.
We are also facing Scheduler stuck issue which sometimes gets resolved by restarting the scheduler pod. There are not log trace in the scheduler process. We are using airflow 1.10.9 with postgres and redis.
We're also seeing this same issue... no idea how to debug. airflow 1.10.9 with postgres / rabbitmq
I see a similar issue on 1.10.9 where the scheduler runs fine on start but typically after 10 to 15 days the CPU utilization actually drops to near 0%. The scheduler health check in the webserver does still pass, but no jobs get scheduled. A restart fixes this.
Seeing as I observe a CPU drop instead of a CPU spike, I'm not sure if these are the same issues, but they share symptoms.
I see a similar issue on 1.10.10... there are no logs to indicate the problem. Airflow with mysql, redis and celery executor.
PS: we still run the scheduler with the arguments -n 10
I've anecdotally noticed that once I've dropped argument -n 25
from our scheduler invocation, I haven't seen this issue come up since. Before, it would crop up every ~10 days or so and it's been about a month now without incident.
Could someone try to run pyspy when this incident occurs? This may bring us to a solution. Thanks to this, we will be able to check what code is currently being executed without restarting the application. https://github.com/benfred/py-spy
root@airflow-scheduler-5b76d7466f-dxdn2:/usr/local/airflow# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5229 0.5 0.0 19932 3596 pts/0 Ss 13:25 0:00 bash
root 5234 0.0 0.0 38308 3376 pts/0 R+ 13:25 0:00 \_ ps auxf
root 1 2.7 0.6 847400 111092 ? Ssl 12:48 1:01 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root 19 0.7 0.5 480420 86124 ? S 12:48 0:16 airflow scheduler -- DagFileProcessorManager
root 5179 0.1 0.0 0 0 ? Z 13:17 0:00 \_ [airflow schedul] <defunct>
root 5180 0.1 0.0 0 0 ? Z 13:17 0:00 \_ [airflow schedul] <defunct>
root 5135 0.0 0.5 847416 96960 ? S 13:17 0:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root 5136 0.0 0.0 0 0 ? Z 13:17 0:00 [/usr/local/bin/] <defunct>
Collecting samples from 'airflow scheduler -- DagFileProcessorManager' (python v3.7.8)
Total Samples 3106
GIL: 0.00%, Active: 1.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
1.00% 1.00% 0.200s 0.200s _send (multiprocessing/connection.py:368)
0.00% 1.00% 0.000s 0.200s start (airflow/utils/dag_processing.py:554)
0.00% 1.00% 0.000s 0.200s wrapper (airflow/utils/cli.py:75)
0.00% 1.00% 0.000s 0.200s _run_processor_manager (airflow/utils/dag_processing.py:624)
0.00% 1.00% 0.000s 0.200s run (airflow/jobs/base_job.py:221)
0.00% 1.00% 0.000s 0.200s _Popen (multiprocessing/context.py:277)
0.00% 1.00% 0.000s 0.200s <module> (airflow:37)
0.00% 1.00% 0.000s 0.200s _send_bytes (multiprocessing/connection.py:404)
0.00% 1.00% 0.000s 0.200s _launch (multiprocessing/popen_fork.py:74)
0.00% 1.00% 0.000s 0.200s scheduler (airflow/bin/cli.py:1040)
0.00% 1.00% 0.000s 0.200s send (multiprocessing/connection.py:206)
0.00% 1.00% 0.000s 0.200s start (airflow/utils/dag_processing.py:861)
0.00% 1.00% 0.000s 0.200s _Popen (multiprocessing/context.py:223)
0.00% 1.00% 0.000s 0.200s _execute_helper (airflow/jobs/scheduler_job.py:1415)
0.00% 1.00% 0.000s 0.200s _bootstrap (multiprocessing/process.py:297)
0.00% 1.00% 0.000s 0.200s _execute (airflow/jobs/scheduler_job.py:1382)
0.00% 1.00% 0.000s 0.200s start (multiprocessing/process.py:112)
0.00% 1.00% 0.000s 0.200s run (multiprocessing/process.py:99)
0.00% 1.00% 0.000s 0.200s __init__ (multiprocessing/popen_fork.py:20)
Happened again today
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid=18 --nonblocking
Process 18: airflow scheduler -- DagFileProcessorManager
Python v3.7.8 (/usr/local/bin/python3.7)
Thread 0x7F1E7B360700 (active): "MainThread"
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:404)
send (multiprocessing/connection.py:206)
start (airflow/utils/dag_processing.py:886)
_run_processor_manager (airflow/utils/dag_processing.py:624)
run (multiprocessing/process.py:99)
_bootstrap (multiprocessing/process.py:297)
_launch (multiprocessing/popen_fork.py:74)
__init__ (multiprocessing/popen_fork.py:20)
_Popen (multiprocessing/context.py:277)
_Popen (multiprocessing/context.py:223)
start (multiprocessing/process.py:112)
start (airflow/utils/dag_processing.py:554)
_execute_helper (airflow/jobs/scheduler_job.py:1415)
_execute (airflow/jobs/scheduler_job.py:1382)
run (airflow/jobs/base_job.py:221)
scheduler (airflow/bin/cli.py:1040)
wrapper (airflow/utils/cli.py:75)
<module> (airflow:37)
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid=18 --native
Process 18: airflow scheduler -- DagFileProcessorManager
Python v3.7.8 (/usr/local/bin/python3.7)
Thread 18 (idle): "MainThread"
__write (libpthread-2.24.so)
_send (multiprocessing/connection.py:368)
_send_bytes (multiprocessing/connection.py:404)
send (multiprocessing/connection.py:206)
start (airflow/utils/dag_processing.py:886)
_run_processor_manager (airflow/utils/dag_processing.py:624)
run (multiprocessing/process.py:99)
_bootstrap (multiprocessing/process.py:297)
_launch (multiprocessing/popen_fork.py:74)
__init__ (multiprocessing/popen_fork.py:20)
_Popen (multiprocessing/context.py:277)
_Popen (multiprocessing/context.py:223)
start (multiprocessing/process.py:112)
start (airflow/utils/dag_processing.py:554)
_execute_helper (airflow/jobs/scheduler_job.py:1415)
_execute (airflow/jobs/scheduler_job.py:1382)
run (airflow/jobs/base_job.py:221)
scheduler (airflow/bin/cli.py:1040)
wrapper (airflow/utils/cli.py:75)
<module> (airflow:37)
@mik-laj does it help ?
Ok so I have more info, here the situation when the scheduler gets stuck:
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 6040 0.0 0.0 19936 3964 pts/0 Ss 20:18 0:00 bash
root 6056 0.0 0.0 38308 3140 pts/0 R+ 20:19 0:00 \_ ps auxf
root 1 2.9 0.7 851904 115828 ? Ssl Jul30 54:46 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root 18 0.9 0.5 480420 86616 ? S Jul30 18:20 airflow scheduler -- DagFileProcessorManager
root 6020 0.1 0.0 0 0 ? Z 20:08 0:00 \_ [airflow schedul] <defunct>
root 6021 0.1 0.0 0 0 ? Z 20:08 0:00 \_ [airflow schedul] <defunct>
root 5977 0.0 0.6 851920 100824 ? S 20:08 0:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root 5978 0.0 0.6 851920 100424 ? S 20:08 0:00 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
I managed to revive the scheduler by killing both 5977 & 5978 pids.
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid 5977
Process 5977: /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
Python v3.7.8 (/usr/local/bin/python3.7)
Thread 5977 (idle): "MainThread"
_flush_std_streams (multiprocessing/util.py:435)
_bootstrap (multiprocessing/process.py:317)
_launch (multiprocessing/popen_fork.py:74)
__init__ (multiprocessing/popen_fork.py:20)
_Popen (multiprocessing/context.py:277)
start (multiprocessing/process.py:112)
_repopulate_pool (multiprocessing/pool.py:241)
__init__ (multiprocessing/pool.py:176)
Pool (multiprocessing/context.py:119)
sync (airflow/executors/celery_executor.py:247)
heartbeat (airflow/executors/base_executor.py:134)
_validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1505)
_execute_helper (airflow/jobs/scheduler_job.py:1443)
_execute (airflow/jobs/scheduler_job.py:1382)
run (airflow/jobs/base_job.py:221)
scheduler (airflow/bin/cli.py:1040)
wrapper (airflow/utils/cli.py:75)
<module> (airflow:37)
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# py-spy dump --pid 5978
Process 5978: /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
Python v3.7.8 (/usr/local/bin/python3.7)
Thread 5978 (idle): "MainThread"
_flush_std_streams (multiprocessing/util.py:435)
_bootstrap (multiprocessing/process.py:317)
_launch (multiprocessing/popen_fork.py:74)
__init__ (multiprocessing/popen_fork.py:20)
_Popen (multiprocessing/context.py:277)
start (multiprocessing/process.py:112)
_repopulate_pool (multiprocessing/pool.py:241)
__init__ (multiprocessing/pool.py:176)
Pool (multiprocessing/context.py:119)
sync (airflow/executors/celery_executor.py:247)
heartbeat (airflow/executors/base_executor.py:134)
_validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1505)
_execute_helper (airflow/jobs/scheduler_job.py:1443)
_execute (airflow/jobs/scheduler_job.py:1382)
run (airflow/jobs/base_job.py:221)
scheduler (airflow/bin/cli.py:1040)
wrapper (airflow/utils/cli.py:75)
<module> (airflow:37)
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# kill -9 5978
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# kill -9 5977
root@airflow-scheduler-5b76d7466f-9w89s:/usr/local/airflow# ps auxf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 6040 0.0 0.0 19936 3964 pts/0 Ss 20:18 0:00 bash
root 6071 0.0 0.0 38308 3176 pts/0 R+ 20:21 0:00 \_ ps auxf
root 1 2.9 0.7 851904 115828 ? Ssl Jul30 54:46 /usr/local/bin/python /usr/local/bin/airflow scheduler -n -1
root 18 0.9 0.5 480420 86616 ? S Jul30 18:20 airflow scheduler -- DagFileProcessorManager
root 6069 0.0 0.5 485184 87268 ? R 20:21 0:00 \_ airflow scheduler - DagFileProcessor /usr/local/airflow/dags/datafactory-kafka2adls-link-1.py
root 6070 0.0 0.5 483640 85684 ? R 20:21 0:00 \_ airflow scheduler - DagFileProcessor /usr/local/airflow/dags/datafactory-kafka2adls-sfdc-history-1.py
We also have this issue: Apache Airflow version: 1.10.10
Kubernetes version (if you are using kubernetes) (use kubectl version): v1.14.10-gke.42
Environment:
Cloud provider or hardware configuration: Google Cloud Kubernetes OS (e.g. from /etc/os-release): "Debian GNU/Linux 10 (buster)" Kernel (e.g. uname -a): Linux airflow-scheduler-77fc4ff87c-k2td5 4.14.138+ #1 SMP Tue Sep 3 02:58:08 PDT 2019 x86_64 GNU/Linux Install tools: Others: What happened: After running correctly for one to a few hours the scheduler simply stops scheduling tasks. No errors appear in any airflow logs (worker and web included). I see CPU go down when it hits the stopping point. We are using postgres/redis
This is happening to us also. No errors appear in the logs but the scheduler will not create new pods, pipelines stall with tasks in 'queued' state, and the scheduler pod must be deleted in order to get things running again.
Any fix for this issue yet? Our scheduler has no heartbeat, CPU spikes then drops, and scheduler is back up after 15 minutes. This is slowing our team down a lot.
Hi, this is happening at Slack too. We are using celery executor. The scheduler just gets stuck, no trace in the logs. Seeing a lot of defunct processes. Restart fixes it. @turbaszek @kaxil @potiuk any ideas what is going on?
We are also facing the same issue with the Airflow 1.10.4 - Mysql - Celery
combination. Found that Schedule - DagFileProcessorManager
gets hung and we've to kill that to get the scheduler back.
@msumit I see the exact same symptom. Please let us know if you find something.
We've experienced this issue twice now, with the CPU spiking to 100% and failing to schedule any tasks after. Our config is Airflow 1.10.6 - Celery - Postgres
running on AWS ECS. I went back into our Cloudwatch logs and noticed the following collection of logs at the time the bug occurred:
| 2020-07-20T07:21:21.346Z | Process DagFileProcessor4357938-Process:
| 2020-07-20T07:21:21.346Z | Traceback (most recent call last):
| 2020-07-20T07:21:21.346Z | File "/usr/local/lib/python3.7/logging/__init__.py", line 1029, in emit
| 2020-07-20T07:21:21.346Z | self.flush()
| 2020-07-20T07:21:21.346Z | File "/usr/local/lib/python3.7/logging/__init__.py", line 1009, in flush
| 2020-07-20T07:21:21.346Z | self.stream.flush()
| 2020-07-20T07:21:21.346Z | OSError: [Errno 28] No space left on device
| 2020-07-20T07:21:21.346Z | During handling of the above exception, another exception occurred:
Which would point to the scheduler running out of memory, likely due to log buildup (I added log cleanup tasks retroactively). I'm not sure if this is related to the scheduler getting stuck though.
Is disk space everyone's issue? I recall either v 1.10.5 or v 1.10.6 had some not-fit-for-production use issue that was fixed in the next version. 1.10.9 has been working okay for us and importantly -n > -1
is not recommended anymore.
I'm curious if you could work around it with AIRFLOW__CORE__BASE_LOG_FOLDER=/dev/null
(probably not because it tries to make sub-dirs right)?
In the meantime we have a systemd timer service (or you use cron) that runs basically (gnu) find:
find <base_log_dir> -mindepth 2 -type f -mtime +6 -delete -or -type d -empty -delete
E.G.
$ tree -D dir/
dir/
└── [Sep 6 23:10] dir
├── [Sep 6 23:10] dir
│ └── [Jan 1 2020] file.txt
├── [Sep 6 23:09] diry
└── [Sep 6 23:10] dirz
└── [Sep 6 23:10] file.txt
4 directories, 2 files
$ find dir -mindepth 2 -type f -mtime +6 -delete -or -type d -empty -delete
$ tree -D dir/
dir/
└── [Sep 6 23:13] dir
└── [Sep 6 23:10] dirz
└── [Sep 6 23:10] file.txt
2 directories, 1 file
All system vitals like the disk, cpu, and mem are absolutely fine whenever the stuck happens for us. Whenever the process stuck, it doesn't respond to any other kill signals except 9 & 11.
I did a strace on the stuck process, it shows the following
futex(0x14d9390, FUTEX_WAIT_PRIVATE, 0, NULL
Then I killed the process with kill -11
and loaded the core in gdb, and below is the stack trace
(gdb) bt
0 0x00007fe49b18b49b in raise () from /lib64/libpthread.so.0
1
2 0x00007fe49b189adb in do_futex_wait.constprop.1 () from /lib64/libpthread.so.0
3 0x00007fe49b189b6f in __new_sem_wait_slow.constprop.0 () from /lib64/libpthread.so.0
4 0x00007fe49b189c0b in sem_wait@@GLIBC_2.2.5 () from /lib64/libpthread.so.0
5 0x0000000000430bc5 in PyThread_acquire_lock_timed ()
6 0x0000000000521a4c in acquire_timed ()
7 0x0000000000521af6 in rlock_acquire ()
8 0x00000000004826cd in _PyCFunction_FastCallDict ()
9 0x00000000004f4143 in call_function ()
10 0x00000000004f7971 in _PyEval_EvalFrameDefault ()
11 0x00000000004f33c0 in _PyFunction_FastCall ()
12 0x00000000004f40d6 in call_function ()
13 0x00000000004f7971 in _PyEval_EvalFrameDefault ()
14 0x00000000004f33c0 in _PyFunction_FastCall ()
15 0x00000000004f40d6 in call_function ()
16 0x00000000004f7971 in _PyEval_EvalFrameDefault ()
17 0x00000000004f33c0 in _PyFunction_FastCall ()
18 0x00000000004f40d6 in call_function ()
19 0x00000000004f7971 in _PyEval_EvalFrameDefault ()
20 0x00000000004f33c0 in _PyFunction_FastCall ()
21 0x00000000004f40d6 in call_function ()
If it helps, the last time this happened, with debug logging on, the scheduler logs this: ending.log before freezing forever and never heartbeating again
https://github.com/apache/airflow/pull/11306 This change improves scheduling process management a little and may help us. Could you check it?
We also are experiencing a similar issue at Nextdoor with 1.10.12 / Postgres / Celery / AWS ECS. Ours looks much like @sylr 's post https://github.com/apache/airflow/issues/7935#issuecomment-667343505 where we have many extra processes spawned that by program args appear identical to the scheduler main process and everything is stuck. However, ours has CPU go to 0 and RAM spike up quite high.
We have a change that correlates (causation is not yet verified) to fixing the issue the @sylr mentioned here where many scheduler main processes spawn at the same time then disappear (which caused an OOM error for us).
The change was the following:
AIRFLOW__CORE__SQL_ALCHEMY_POOL_SIZE
- 5
+ 11
AIRFLOW__CORE__SQL_ALCHEMY_MAX_OVERFLOW
- 10
+ 30
AIRFLOW__CORE__SQL_ALCHEMY_POOL_RECYCLE
- 3600
+ 1800
And we run MAX_THREADS=10. Is it possible that reaching pool_size or pool_size+max_overflow caused processes to back up or spawn oddly? Before this change, the scheduler was getting stuck 1-2 times per day, now we have not seen this issue since the change 6 days ago.
``` $ while true; do pgrep -f 'airflow scheduler' | wc -l; sleep .5; done 39 4 4 4 39 39 39 39 39 5 5 5 5 5 5 5 3 3 3 38 3 3 2 2 2 2 2 37 2 2 2 2 2 2 2 7 2 8 3 8 2 4 3 3 3 3 2 2 2 2 2 2 2 2 4 3 3 3 9 3 3 3 13 3 3 3 17 2 2 2 2 2 2 2 24 2 2 4 ```
Can anyone else verify this change helps or not?
Same issue here with 1.10.12 + rabbitmq + celery + k8s. The scheduler keeps logging [2020-10-23 08:10:21,387] {{scheduler_job.py:237}} WARNING - Killing PID 30918
while in the container side shows [airflow schedul] <defunct>
generated by airflow scheduler - DagFileProcessor <example_dag.py>
over and over again. And the scheduler just get stuck and never schedule any tasks.
Have you tried to cat the {AIRFLOW_HOME}/logs/dag_processor_manager/dag_processor_manager.log
in the scheduler pod? It may help, maybe the DAG Processor timeout.
@teastburn we tried these settings and it did not fix things. @duyet I tailed those logs and didn't see anything out of the ordinary, the logs just... stop. It's happening now, so any debugging info you'd like me to take I can do: .
We use airflow here at Cloudflare to run a couple hundred jobs a day, and this has become a major issue for us. It is very difficult for us to downgrade to a version older than 1.10 and ever since we upgraded this has been a persistent and very annoying issue. Every 3-6 hours, every day for the past 4 months, the scheduler just stops running. The only "solution" we've found is to run a cronjob that kills the scheduler pod every 6 hours. And that leaves a ton of dangling tasks around, it is not a permanent or even really a workable solution.
I'm happy to debug as much as possible, I've tried digging into the code myself as well, but I simply don't have the familiarity to figure out what's going wrong without a significant time investment. Any help y'all can give us would be massively appreciated. At this point we're considering dropping airflow. We simply can't continue working with such a flaky platform.
@norwoodj you can also set log level to DEBUG.
I used to stuck with Airflow scheduler because the processing is timeout 50s by default.
I got the same problem with the scheduler stop working. Airflow scheduler also have the run_duration
to restart the Scheduler automatically after a given amount of time.
https://github.com/apache/airflow/blob/1.10.12/airflow/config_templates/default_airflow.cfg#L643
\_ [airflow schedul] <defunct>
This looks like a Python bug. I have already used a workaround in one place to fix a similar problem. I think it would be worth checking if we do not have a similar problem in DagFileProcessor. See: https://github.com/apache/airflow/pull/11734
__init__ (multiprocessing/pool.py:176) Pool (multiprocessing/context.py:119) sync (airflow/executors/celery_executor.py:247)
This change recently improved the use of Pool in the celery of an executor. I think it's also worth checking out. https://github.com/apache/airflow/pull/11278/files
I had ~4800 tasks from the same DAG stuck after a manual reset, with the scheduler just killing PIDs.
Turning other DAGs off and increasing DAGBAG_IMPORT_TIMEOUT
did not help. Also restarting webserver/scheduler/redis/mysql had no effect.
After setting the "running" dagruns with the stuck tasks to "failed" and then back to "running" in smaller batches the scheduler managed to queue them.
(Airflow 1.10.10 with Celery)
__init__ (multiprocessing/pool.py:176) Pool (multiprocessing/context.py:119) sync (airflow/executors/celery_executor.py:247)
This change recently improved the use of Pool in the celery of an executor. I think it's also worth checking out. https://github.com/apache/airflow/pull/11278/files
Can the above link solve the problem? @mik-laj
@maijh I do not have the capacity to reproduce this bug, but I am sharing tips on what could be causing the bug as I am watching all community activity.
I had ~4800 tasks from the same DAG stuck after a manual reset, with the scheduler just killing PIDs. Turning other DAGs off and increasing
DAGBAG_IMPORT_TIMEOUT
did not help. Also restarting webserver/scheduler/redis/mysql had no effect.After setting the "running" dagruns with the stuck tasks to "failed" and then back to "running" in smaller batches the scheduler managed to queue them.
(Airflow 1.10.10 with Celery)
@michaelosthege This behaviour should be fixed in 2.0.0 (now in beta stages) thanks to https://github.com/apache/airflow/pull/10956
@dlamblin why was this closed? My read of this most recent comment was that it described a different issue than the one this issue refers to, and @ashb was pointing out that that bug was fixed, not necessarily the underlying one that this issue references.
If this issue is also fixed by that pull request, then great. I just want to be sure this issue isn't being closed by mistake because this is still a huge issue for us.
TBC this issue should be fixed in 2.0.0 as we have massively reworked the Scheduler, but let's leave it open until we have confirmation
TBC this issue should be fixed in 2.0.0 as we have massively reworked the Scheduler, but let's leave it open until we have confirmation
this issue have been fixed in version 2.0.0 or will be fixed in the future version 2.x.x ? @ashb
I'm hopeful that #10956, which is already merged, will fix this issue, and they will be included in 2.0.0
Any idea when 2.0 will be released?
any confirmation yet on whether this is fixed in 2.0?
Sadly not - not entirely anyway. It should be much better, but there might still be some cases where the scheduler doesn't behave itself, but from the reports we've had, it's much more binary (either it's working or just it never schedules anything from start) rather than working for a while then stopping to do anything.
Commenting to track this thread.
@BwL1289 (and others) if you are hitting this problem, please can you let me know what versions of Airflow you see this on?
Seeing this on 1.10.14
Seeing this on 1.10.9
Seeing this on 1.10.9
Seeing this on 1.10.8 with Celery executor. We are running the scheduler with num duration 900 seconds. It would run fine for a couple of days then suddenly just freeze.
Thread 1 (idle): "MainThread"
wait (threading.py:295)
wait (threading.py:551)
wait (multiprocessing/pool.py:635)
get (multiprocessing/pool.py:638)
map (multiprocessing/pool.py:266)
trigger_tasks (lyft_etl/airflow/executors/lyft_celery_executor.py:90)
heartbeat (airflow/executors/base_executor.py:130)
_validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1536)
_execute_helper (airflow/jobs/scheduler_job.py:1473)
_execute (airflow/jobs/scheduler_job.py:1412)
run (airflow/jobs/base_job.py:221)
scheduler (airflow/bin/cli.py:1117)
wrapper (airflow/utils/cli.py:75)
<module> (airflow/bin/airflow:37)
<module> (airflow:7)
Thread 97067 (idle): "Thread-5667"
_handle_workers (multiprocessing/pool.py:406)
run (threading.py:864)
_bootstrap_inner (threading.py:916)
_bootstrap (threading.py:884)
Thread 97068 (idle): "Thread-5668"
wait (threading.py:295)
@ashb perhaps there is somewhere in the scheduler loop where there is a race condition? Would be interesting to see this same thread trace on 2.0.
Apache Airflow version:
Kubernetes version (if you are using kubernetes) (use
kubectl version
):Environment:
uname -a
):The scheduler gets stuck without a trace or error. When this happens, the CPU usage of scheduler service is at 100%. No jobs get submitted and everything comes to a halt. Looks it goes into some kind of infinite loop. The only way I could make it run again is by manually restarting the scheduler service. But again, after running some tasks it gets stuck. I've tried with both Celery and Local executors but same issue occurs. I am using the -n 3 parameter while starting scheduler.
Scheduler configs, job_heartbeat_sec = 5 scheduler_heartbeat_sec = 5 executor = LocalExecutor parallelism = 32
Please help. I would be happy to provide any other information needed
What you expected to happen:
How to reproduce it:
Anything else we need to know:
Moved here from https://issues.apache.org/jira/browse/AIRFLOW-401