Scheduler livenessProbe errors on new helm chart

fernhtls commented 2 years ago

Official Helm Chart version

1.6.0 (latest released)

Apache Airflow version

v2.1.2

Kubernetes Version

v1.22.10 (GKE version v1.22.10-gke.600)

Helm Chart configuration

Only livenessProbe config before and during the issue:

# Airflow scheduler settings
scheduler:
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 15
    failureThreshold: 10
    periodSeconds: 60

Docker Image customisations

Here's the image we use based on the apache airflow image:

### Main official airflow image
FROM apache/airflow:2.1.2-python3.8

USER root

RUN apt update

RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections

### Add OS packages here
## GCC compiler in case it's needed for installing python packages
RUN apt install -y -q build-essential

USER airflow

### Changing the default SSL / TLS mode for mysql client to work properly
## https://askubuntu.com/questions/1233186/ubuntu-20-04-how-to-set-lower-ssl-security-level
## https://bugs.launchpad.net/ubuntu/+source/mysql-8.0/+bug/1872541
## https://stackoverflow.com/questions/61649764/mysql-error-2026-ssl-connection-error-ubuntu-20-04
RUN echo $'openssl_conf = default_conf\n\
[default_conf]\n\
ssl_conf = ssl_sect\n\
[ssl_sect]\n\
system_default = ssl_default_sect\n\
[ssl_default_sect]\n\
MinProtocol = TLSv1\n\
CipherString = DEFAULT:@SECLEVEL=1' >> /home/airflow/.openssl.cnf
## OS env var to point to the new openssl.cnf file
ENV OPENSSL_CONF=/home/airflow/.openssl.cnf

### Add airflow providers
RUN pip install apache-airflow-providers-apache-beam
### End airflow providers

### Add extra python packages
RUN pip install python-slugify==3.0.3
### End extra python packages

What happened

After the upgrade to the helm chart 1.6.0, the scheduler POD was restarting as the livenessProbe was failing.

Command for the new livenessProbe from helm chart 1.6.0 tested directly on our scheduler POD:

airflow@yc-data-airflow-scheduler-0:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
> airflow jobs check --job-type SchedulerJob --hostname $(hostname)
No alive jobs found.

Removing the --hostname argument works and a live job is found:

airflow@yc-data-airflow-scheduler-0:/opt/airflow$ CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
> airflow jobs check --job-type SchedulerJob
Found one alive job.

What you think should happen instead

livenessProbe should not error.

How to reproduce

No response

Anything else

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

ilyadinaburg commented 2 years ago

Hi, I'm facing the same issue as @fernhtls with scheduler and triggerer pods. Running Airflow 2.3.2 on AWS, official helm chart 1.6.0.

Below is the result of livenessProbe command:

airflow@airflow2-test-us-west-2-scheduler-0:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --hostname $(hostname)"
No alive jobs found.

If I ran it without hostname as @fernhtls suggested:

airflow@airflow2-test-us-west-2-scheduler-0:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob"
Found one alive job.

However, if i run it with IP address of scheduler pod instead of hostname I got:

airflow@airflow2-test-us-west-2-scheduler-0:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --job-type SchedulerJob --hostname 172.XXX.XXX.XXX"
Found one alive job.

So assume that might be related to setting HOSTNAME_CALLABLE in my config:

    - name: AIRFLOW__CORE__HOSTNAME_CALLABLE
      value: 'airflow.utils.net.get_host_ip_address'

Anyway, would be great to hear from you.

fernhtls commented 2 years ago

hi @ilyadinaburg , I didn't have the time to check properly the whole issue on itself, in my case I'm pushing a custom livenessProbe to the helm chart without the --hostname argument as below:

scheduler:
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 15
    failureThreshold: 10
    periodSeconds: 60
    command:
      - sh
      - -c
      - |
        CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
        airflow jobs check --job-type SchedulerJob

I didn't check the env var AIRFLOW__CORE__HOSTNAME_CALLABLE, as I had to push a quick fix as the scheduler was restarting and breaking the execution of DAGs / tasks.

Good that the issue seems to be reproducible, as you're using the same new latest helm chart version as I am (1.6.0).

ilyadinaburg commented 2 years ago

hey @fernhtls, I think that the issue might not be directly related to AIRFLOWCOREHOSTNAME_CALLABLE, however, might be this is obvious but still as per your findings it is related to the fact that pod hostname not being resolved. So as a workaround $(hostname -i)instead of $(hostname) may be used as well in liveliness probe command.

fernhtls commented 2 years ago

hi @ilyadinaburg ,thanks for the info, I'll check on your proposal later on. On an HA deployment for the scheduler (statefulset with 2 pods or more), the liveness needs to check if the process is up and running only on it's own host, so what I did it's not the correct / best approach indeed. Let's wait for more info from someone from apache/airflow itself with more info.

alifouzizadeh commented 2 years ago

Hi @fernhtls, what @ilyadinaburg suggested works for me, I changed $(hostname) to $(hostname -i), and it is working. So, the code in the values.yaml should be like this:

scheduler:
  livenessProbe:
    initialDelaySeconds: 10
    timeoutSeconds: 120
    failureThreshold: 20
    periodSeconds: 60
    command:
      - sh
      - -c
      - |
        CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
        airflow jobs check --job-type SchedulerJob --hostname $(hostname -i)

sk-ilya commented 2 years ago

$(hostname -f) worked for us

VladimirYushkevich commented 2 years ago

I have the same issue with airflow 2.4.1 to check liveness on dag-processor's pod. Default is not working for me:

dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --hostname $(hostname)"
No alive jobs found.

as well as with $(hostname -i):

dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check --hostname $(hostname -i)"
No alive jobs found.

Only after removing --hostname flag it works:

dag-processor-ccb9f9949-7zdtv:/opt/airflow$ sh -c "CONNECTION_CHECK_MAX_COUNT=0 AIRFLOW__LOGGING__LOGGING_LEVEL=ERROR exec /entrypoint \
airflow jobs check"
Found one alive job.

csp33 commented 1 year ago

@VladimirYushkevich I believe the Standalone DAG Processor issue is totally different. I have documented my theory in the following issue: https://github.com/apache/airflow/issues/27140

BobDu commented 1 year ago

maybe we can close this issue, fix by #24999 ?

potiuk commented 1 year ago

Can any of the people here verify it? I am closing it provisionally, unless someone reports it's NOT fixed.

eduardchai commented 1 year ago

I am still having this issue in: Airflow version: 2.4.3 Chart version: 1.7.0

Problematic config:

"AIRFLOW__CORE__HOSTNAME_CALLABLE" = "airflow.utils.net.get_host_ip_address"

kubectl event of the scheduler:

LAST SEEN   TYPE      REASON      OBJECT                                       MESSAGE
2m50s       Warning   Unhealthy   pod/airflow-main-scheduler-589ff44fd-v5c97   Liveness probe failed: No alive jobs found.

I believe this is causing Kubernetes to send sigterm to scheduler pod and causing this error:

[2022-12-05T03:31:42.219+0000] {scheduler_job.py:172} INFO - Exiting gracefully upon receiving signal 15
[2022-12-05T03:31:43.222+0000] {process_utils.py:129} INFO - Sending Signals.SIGTERM to group 35. PIDs of all processes in the group: [35]
[2022-12-05T03:31:43.223+0000] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 35
[2022-12-05T03:31:43.436+0000] {process_utils.py:79} INFO - Process psutil.Process(pid=35, status='terminated', exitcode=0, started='03:26:53') (35) terminated with exit code 0
[2022-12-05T03:31:43.440+0000] {kubernetes_executor.py:823} INFO - Shutting down Kubernetes executor
[2022-12-05T03:31:43.440+0000] {scheduler_job.py:768} ERROR - Exception when executing Executor.end
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 745, in _execute
    self._run_scheduler_loop()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 865, in _run_scheduler_loop
    num_queued_tis = self._do_scheduling(session)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 945, in _do_scheduling
    callback_tuples = self._schedule_all_dag_runs(guard, dag_runs, session)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 78, in wrapped_function
    for attempt in run_with_db_retries(max_retries=retries, logger=logger, **retry_kwargs):
  File "/home/airflow/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 384, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/airflow/.local/lib/python3.9/site-packages/tenacity/__init__.py", line 351, in iter
    return fut.result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/retries.py", line 87, in wrapped_function
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 1234, in _schedule_all_dag_runs
    callback_to_run = self._schedule_dag_run(dag_run, session)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 1300, in _schedule_dag_run
    schedulable_tis, callback_to_run = dag_run.update_state(session=session, execute_callbacks=False)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/dagrun.py", line 555, in update_state
    info = self.task_instance_scheduling_decisions(session)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/dagrun.py", line 673, in task_instance_scheduling_decisions
    tis = self.get_task_instances(session=session, state=State.task_states)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/models/dagrun.py", line 455, in get_task_instances
    return tis.all()
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 2759, in all
    return self._iter().all()
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 1361, in all
    return self._allrows()
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 400, in _allrows
    rows = self._fetchall_impl()
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 1274, in _fetchall_impl
    return self._real_result._fetchall_impl()
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 1686, in _fetchall_impl
    return list(self.iterator)
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 147, in chunks
    fetch = cursor._raw_all_rows()
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 392, in _raw_all_rows
    return [make_row(row) for row in rows]
  File "/home/airflow/.local/lib/python3.9/site-packages/sqlalchemy/engine/result.py", line 392, in <listcomp>
    return [make_row(row) for row in rows]
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/utils/sqlalchemy.py", line 193, in process
    value['pod_override'] = BaseSerialization.deserialize(pod_override)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/serialization/serialized_objects.py", line 476, in deserialize
    pod = PodGenerator.deserialize_model_dict(var)
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/kubernetes/pod_generator.py", line 437, in deserialize_model_dict
    return api_client._ApiClient__deserialize_model(pod_dict, k8s.V1Pod)
  File "/home/airflow/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 641, in __deserialize_model
    instance = klass(**kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/kubernetes/client/models/v1_pod.py", line 60, in __init__
    self._spec = None
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 175, in _exit_gracefully
    sys.exit(os.EX_OK)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 766, in _execute
    self.executor.end()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/celery_kubernetes_executor.py", line 182, in end
    self.kubernetes_executor.end()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 825, in end
    self._flush_task_queue()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py", line 778, in _flush_task_queue
    self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
  File "<string>", line 2, in qsize
  File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 416, in _send_bytes
    self._send(header + buf)
  File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 373, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[2022-12-05T03:31:43.445+0000] {process_utils.py:129} INFO - Sending Signals.SIGTERM to group 35. PIDs of all processes in the group: []
[2022-12-05T03:31:43.445+0000] {process_utils.py:84} INFO - Sending the signal Signals.SIGTERM to group 35
[2022-12-05T03:31:43.445+0000] {process_utils.py:98} INFO - Sending the signal Signals.SIGTERM to process 35 as process group is missing.
[2022-12-05T03:31:43.446+0000] {scheduler_job.py:774} INFO - Exited execute loop

BobDu commented 1 year ago

@eduardchai

use airlfow 2.5.0 we support liveness cmd arg --local start in 2.5.0

eduardchai commented 1 year ago

@BobDu I see... Thank you for the update! I will try 2.5.0.

Subhashini2610 commented 1 year ago

@BobDu I am also facing the same issue. Even upgrading to 2.5.0 didn't help in my case.

eduardchai commented 1 year ago

@Subhashini2610 have you upgraded the chart to 1.7.0? The issue is solved for me using chart 1.7.0 and airflow 2.5.0.

Subhashini2610 commented 1 year ago

@eduardchai I have used 1.8.0-dev version. I can see the --local change in the liveness probe command as well.

devopspsn commented 1 year ago

I have same issue with chart 1.9.0 and airflow 2.5.3. What did I missed?

potiuk commented 1 year ago

You missed that this issue is closed. If you find no answer here and you still see similar issue please open anew on with all logs and information in what circumstances you see a problem. Or even better - if you are not sure if it is an airflow issue, create a discussion and describe it there.

Even if you see something similiar, the best you can do if you seek help is to describe your issue in detail so that others can help you.

apache / airflow