apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.7k stars 14.21k forks source link

Scheduler fails with BrokenPipeError #16298

Closed cccs-cat001 closed 3 years ago

cccs-cat001 commented 3 years ago

Apache Airflow version: 2.1.0

Kubernetes version (if you are using kubernetes) (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"6b3f9b283463c1d5a2455df301182805e65c7145", GitTreeState:"clean", BuildDate:"2021-05-19T22:28:47Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

Environment:

What happened: Since I launched airflow 2.1.0 on our cluster on Friday, the scheduler has failed 716 times stating "BrokenPipeError"

[2021-06-07 12:07:19,362] {scheduler_job.py:1205} INFO - Executor reports execution of demo_git_notebook_parameterized.demo_git_notebook_parameterized execution_date=2021-06-07 12:05:41.835167+00:00 exited with status None for try_number 1
[2021-06-07 12:07:22,798] {scheduler_job.py:748} INFO - Exiting gracefully upon receiving signal 15
[2021-06-07 12:07:23,800] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 55
[2021-06-07 12:07:24,154] {process_utils.py:207} INFO - Waiting up to 5 seconds for processes to exit...
[2021-06-07 12:07:24,211] {process_utils.py:207} INFO - Waiting up to 5 seconds for processes to exit...
[2021-06-07 12:07:24,265] {process_utils.py:66} INFO - Process psutil.Process(pid=55, status='terminated', exitcode=0, started='12:02:39') (55) terminated with exit code 0
[2021-06-07 12:07:24,266] {process_utils.py:66} INFO - Process psutil.Process(pid=7433, status='terminated', started='12:07:23') (7433) terminated with exit code None
[2021-06-07 12:07:24,266] {process_utils.py:66} INFO - Process psutil.Process(pid=7432, status='terminated', started='12:07:22') (7432) terminated with exit code None
[2021-06-07 12:07:24,266] {kubernetes_executor.py:759} INFO - Shutting down Kubernetes executor
[2021-06-07 12:07:24,266] {scheduler_job.py:1308} ERROR - Exception when executing Executor.end
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1286, in _execute
    self._run_scheduler_loop()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1400, in _run_scheduler_loop
    time.sleep(min(self._processor_poll_interval, next_event))
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 751, in _exit_gracefully
    sys.exit(os.EX_OK)
SystemExit: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1306, in _execute
    self.executor.end()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 761, in end
    self._flush_task_queue()
  File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 714, in _flush_task_queue
    self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
  File "<string>", line 2, in qsize
  File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 834, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
[2021-06-07 12:07:24,268] {process_utils.py:100} INFO - Sending Signals.SIGTERM to GPID 55
[2021-06-07 12:07:24,268] {scheduler_job.py:1313} INFO - Exited execute loop

What you expected to happen: For it to not do that.

How to reproduce it: I'm not too sure. Could it be an issue with Airflow 2.1.0 itself, and it can be reproduced just by launching it in a cluster? Using KubernetesExecutor, no celery. Could it be an issue with Azure?

Anything else we need to know: by my very rough calculations it happens every 6 minutes?

jedcunningham commented 3 years ago

Looks like the scheduler is getting a sigterm signal. Any hints in the events for the pod (you'll want a recent pod)?

kubectl get event --field-selector involvedObject.name={scheduler_pod_name}

BrokenPipeError happens after sys.exit(0), so you really need to trace down whats sigterming your scheduler constantly.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

saisujith2 commented 3 years ago

@jedcunningham I'm having the same issue with the scheduler. The scheduler has an event with the following error

`P:>kubectl get event --field-selector involvedObject.name=airflow-ml-dev-scheduler-6699c76bb7-xwrjb -n airflow-ml-dev LAST SEEN TYPE REASON OBJECT MESSAGE 6m44s Warning DNSConfigForming pod/airflow-ml-dev-scheduler-6699c76bb7-xwrjb Search Line limits were exceeded, some search paths have been omitted, the applied search line is: airflow-ml-dev.svc.cluster.local svc.cluster.local cluster.local pldc.kp.org crdc.kp.org ivdc.kp.org 103s Warning Unhealthy pod/airflow-ml-dev-scheduler-6699c76bb7-xwrjb (combined from similar events): Liveness probe failed: Could not find platform independent libraries Could not find platform dependent libraries Consider setting $PYTHONHOME to [:] Python path configuration: PYTHONHOME = (not set) PYTHONPATH = (not set) program name = 'python' isolated = 0 environment = 1 user site = 1 import site = 1 sys._base_executable = '/usr/local/bin/python' sys.base_prefix = '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho' sys.base_exec_prefix = '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho' sys.executable = '/usr/local/bin/python' sys.prefix = '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho' sys.exec_prefix = '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho' sys.path = [ '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python38.zip', '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/python3.8', '/tmp/build/80754af9/python_1599203911753/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/lib/lib-dynload', ] Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding Python runtime state: core initialized ModuleNotFoundError: No module named 'encodings' Current thread 0x00007ff1b4e84740 (most recent call first):

`
potiuk commented 3 years ago

@saisujithkp the error suggests that your python installation is broken - likely your image has some problems or virtualenvs created by your deployment has been somehow messed up.

I suggest you use the Official helm chart https://airflow.apache.org/docs/helm-chart/stable/index.html and build the image using the official Dockerfile: https://airflow.apache.org/docs/docker-stack/build.html

More info about the "encodings" error is here: https://stackoverflow.com/questions/38132755/importerror-no-module-named-encodings

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

github-actions[bot] commented 3 years ago

This issue has been closed because it has not received response from the issue author.

Michalos88 commented 1 year ago

I had an identical issue with the scheduler using the Kubernetes Executor for airflow image (2.2.5-python3.8) deployed using Community Airflow Chart.

I solved the issue by turning off the taskCreationCheck.