Closed fernhtls closed 1 year ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Hi, I'm facing the same issue as @fernhtls with scheduler and triggerer pods. Running Airflow 2.3.2 on AWS, official helm chart 1.6.0.
Below is the result of livenessProbe command:
airflow@airflow2-test-us-west-2-scheduler-0:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --hostname $(hostname)"
No alive jobs found.
If I ran it without hostname as @fernhtls suggested:
airflow@airflow2-test-us-west-2-scheduler-0:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob"
Found one alive job.
However, if i run it with IP address of scheduler pod instead of hostname I got:
airflow@airflow2-test-us-west-2-scheduler-0:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --hostname 172.XXX.XXX.XXX"
Found one alive job.
So assume that might be related to setting HOSTNAME_CALLABLE in my config:
- name: AIRFLOW__CORE__HOSTNAME_CALLABLE
value: 'airflow.utils.net.get_host_ip_address'
Anyway, would be great to hear from you.
hi @ilyadinaburg , I didn't have the time to check properly the whole issue on itself, in my case I'm pushing a custom livenessProbe
to the helm chart without the --hostname
argument as below:
scheduler:
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 15
failureThreshold: 10
periodSeconds: 60
command:
- sh
- -c
- |
CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob
I didn't check the env var AIRFLOW__CORE__HOSTNAME_CALLABLE
, as I had to push a quick fix as the scheduler was restarting and breaking the execution of DAGs / tasks.
Good that the issue seems to be reproducible, as you're using the same new latest helm chart version as I am (1.6.0
).
hey @fernhtls, I think that the issue might not be directly related to AIRFLOWCOREHOSTNAME_CALLABLE, however, might be this is obvious but still as per your findings it is related to the fact that pod hostname not being resolved.
So as a workaround $(hostname -i)
instead of $(hostname)
may be used as well in liveliness probe command.
hi @ilyadinaburg ,thanks for the info, I'll check on your proposal later on. On an HA deployment for the scheduler (statefulset with 2 pods or more), the liveness needs to check if the process is up and running only on it's own host, so what I did it's not the correct / best approach indeed. Let's wait for more info from someone from apache/airflow itself with more info.
Hi @fernhtls, what @ilyadinaburg suggested works for me, I changed $(hostname)
to $(hostname -i)
, and it is working.
So, the code in the values.yaml
should be like this:
scheduler:
livenessProbe:
initialDelaySeconds: 10
timeoutSeconds: 120
failureThreshold: 20
periodSeconds: 60
command:
- sh
- -c
- |
CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --hostname $(hostname -i)
$(hostname -f)
worked for us
I have the same issue with airflow 2.4.1
to check liveness
on dag-processor
's pod. Default is not working for me:
dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --hostname $(hostname)"
No alive jobs found.
as well as with $(hostname -i)
:
dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --hostname $(hostname -i)"
No alive jobs found.
Only after removing --hostname
flag it works:
dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check"
Found one alive job.
@VladimirYushkevich I believe the Standalone DAG Processor issue is totally different. I have documented my theory in the following issue: https://github.com/apache/airflow/issues/27140
maybe we can close this issue, fix by #24999 ?
Can any of the people here verify it? I am closing it provisionally, unless someone reports it's NOT fixed.
I am still having this issue in: Airflow version: 2.4.3 Chart version: 1.7.0
Problematic config:
"AIRFLOW__CORE__HOSTNAME_CALLABLE" = "airflow.utils.net.get_host_ip_address"
kubectl event
of the scheduler:
LAST SEEN TYPE REASON OBJECT MESSAGE
2m50s Warning Unhealthy pod/airflow-main-scheduler-589ff44fd-v5c97 Liveness probe failed: No alive jobs found.
I believe this is causing Kubernetes to send sigterm to scheduler pod and causing this error:
[2022-12-05T03:31:42.219+0000] {scheduler_job.py:172} INFO - Exiting gracefully upon receiving signal 15
[2022-12-05T03:31:43.222+0000] {process_utils.py:129} INFO - Sending Signals.SIGTERM to group 35. PIDs of all processes in the group: [35]
[2022-12-05T03:31:43.223+0000] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 35
[2022-12-05T03:31:43.436+0000] {process_utils.py:79} INFO - Process psutil.Process(pid=35, status='terminated', exitcode=0, started='03:26:53') (35) terminated with exit code 0
[2022-12-05T03:31:43.440+0000] {kubernetes_executor.py:823} INFO - Shutting down Kubernetes executor
[2022-12-05T03:31:43.440+0000] {scheduler_job.py:768} ERROR - Exception when executing Executor.end
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 745, in _execute
self._run_scheduler_loop()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 865, in _run_scheduler_loop
num_queued_tis = self._do_scheduling(session)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 945, in _do_scheduling
callback_tuples = self._schedule_all_dag_runs(guard, dag_runs, session)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 78, in wrapped_function
for attempt in run_with_db_retries(max_retries=retries, logger=logger, **retry_kwargs):
File "/home/airflow/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 384, in __iter__
do = self.iter(retry_state=retry_state)
File "/home/airflow/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 351, in iter
return fut.result()
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 87, in wrapped_function
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 1234, in _schedule_all_dag_runs
callback_to_run = self._schedule_dag_run(dag_run, session)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 1300, in _schedule_dag_run
schedulable_tis, callback_to_run = dag_run.update_state(session=session, execute_callbacks=False)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/dagrun.py", line 555, in update_state
info = self.task_instance_scheduling_decisions(session)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/dagrun.py", line 673, in task_instance_scheduling_decisions
tis = self.get_task_instances(session=session, state=State.task_states)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/dagrun.py", line 455, in get_task_instances
return tis.all()
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 2759, in all
return self._iter().all()
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 1361, in all
return self._allrows()
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 400, in _allrows
rows = self._fetchall_impl()
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 1274, in _fetchall_impl
return self._real_result._fetchall_impl()
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 1686, in _fetchall_impl
return list(self.iterator)
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 147, in chunks
fetch = cursor._raw_all_rows()
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 392, in _raw_all_rows
return [make_row(row) for row in rows]
File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 392, in <listcomp>
return [make_row(row) for row in rows]
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/sqlalchemy.py", line 193, in process
value['pod_override'] = BaseSerialization.deserialize(pod_override)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py", line 476, in deserialize
pod = PodGenerator.deserialize_model_dict(var)
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/kubernetes/pod_generator.py", line 437, in deserialize_model_dict
return api_client._ApiClient__deserialize_model(pod_dict, k8s.V1Pod)
File "/home/airflow/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 641, in __deserialize_model
instance = klass(**kwargs)
File "/home/airflow/.local/lib/python3.9/site-packages/kubernetes/client/models/v1_pod.py", line 60, in __init__
self._spec = None
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 175, in _exit_gracefully
sys.exit(os.EX_OK)
SystemExit: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 766, in _execute
self.executor.end()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/celery_kubernetes_executor.py", line 182, in end
self.kubernetes_executor.end()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 825, in end
self._flush_task_queue()
File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 778, in _flush_task_queue
self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
File "<string>", line 2, in qsize
File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
conn.send((self._id, methodname, args, kwds))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 211, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
self._send(header + buf)
File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 373, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[2022-12-05T03:31:43.445+0000] {process_utils.py:129} INFO - Sending Signals.SIGTERM to group 35. PIDs of all processes in the group: []
[2022-12-05T03:31:43.445+0000] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 35
[2022-12-05T03:31:43.445+0000] {process_utils.py:98} INFO - Sending the signal Signals.SIGTERM to process 35 as process group is missing.
[2022-12-05T03:31:43.446+0000] {scheduler_job.py:774} INFO - Exited execute loop
@eduardchai
use airlfow 2.5.0
we support liveness cmd arg --local
start in 2.5.0
@BobDu I see... Thank you for the update! I will try 2.5.0.
@BobDu I am also facing the same issue. Even upgrading to 2.5.0 didn't help in my case.
@Subhashini2610 have you upgraded the chart to 1.7.0
? The issue is solved for me using chart 1.7.0 and airflow 2.5.0.
@eduardchai I have used 1.8.0-dev version. I can see the --local change in the liveness probe command as well.
I have same issue with chart 1.9.0 and airflow 2.5.3. What did I missed?
You missed that this issue is closed. If you find no answer here and you still see similar issue please open anew on with all logs and information in what circumstances you see a problem. Or even better - if you are not sure if it is an airflow issue, create a discussion and describe it there.
Even if you see something similiar, the best you can do if you seek help is to describe your issue in detail so that others can help you.
Official Helm Chart version
1.6.0 (latest released)
Apache Airflow version
v2.1.2
Kubernetes Version
v1.22.10 (GKE version v1.22.10-gke.600)
Helm Chart configuration
Only livenessProbe config before and during the issue:
Docker Image customisations
Here's the image we use based on the apache airflow image:
What happened
After the upgrade to the helm chart
1.6.0
, the scheduler POD was restarting as thelivenessProbe
was failing.Command for the new
livenessProbe
from helm chart1.6.0
tested directly on our scheduler POD:Removing the
--hostname
argument works and alive job is found
:What you think should happen instead
livenessProbe
should not error.How to reproduce
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct