Error in KubernetesPodOperator while fetching logs from kube API #21727

Closed chengzi0103 closed 2 years ago

chengzi0103 commented 2 years ago

Apache Airflow version

2.2.3 (latest released)

What happened

airflow log error when running multiple k8s_pods

[2022-02-22, 07:36:47 UTC] {pod_manager.py:163} WARNING - Pod not yet started: ivol.78b6e382f2204fe3a8da9081cb468acb
[2022-02-22, 07:46:48 UTC] {pod_manager.py:205} WARNING - Failed to read logs for pod ivol.78b6e382f2204fe3a8da9081cb468acb
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 697, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 438, in _error_catcher
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 764, in read_chunked
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 701, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 201, in follow_logs
    for line in logs:
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 808, in __iter__
    for chunk in self.stream(decode_content=True):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 572, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 793, in read_chunked
  File "/usr/local/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/site-packages/urllib3/response.py", line 455, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
[2022-02-22, 07:46:48 UTC] {pod_manager.py:218} WARNING - Pod ivol.78b6e382f2204fe3a8da9081cb468acb log read interrupted but container base still running
[2022-02-22, 07:52:04 UTC] {pod_manager.py:218} WARNING - Pod ivol.78b6e382f2204fe3a8da9081cb468acb log read interrupted but container base still running
[2022-02-22, 07:52:05 UTC] {pod_manager.py:218} WARNING - Pod ivol.78b6e382f2204fe3a8da9081cb468acb log read interrupted but container base still running
[2022-02-22, 07:52:06 UTC] {kubernetes_pod.py:417} INFO - Deleting pod: ivol.78b6e382f2204fe3a8da9081cb468acb
[2022-02-22, 07:52:07 UTC] {taskinstance.py:1700} ERROR - Task failed with exception

What you expected to happen

how to fix it ?

How to reproduce

I have three machines Machine A: run airflow webserver and scheduler B and C machines: run celery worker All operators run on the k8s cluster through k8s_config

Once I run multiple tasks, the program will automatically report the error log problem I don't know how to solve it

Operating System

Debian GNU/Linux 11

Versions of Apache Airflow Providers

alembic 1.7.5 amqp 5.0.9 anyio 3.5.0 apache-airflow 2.2.3 apache-airflow-providers-celery 2.1.0 apache-airflow-providers-cncf-kubernetes 3.0.2 apache-airflow-providers-docker 2.4.1 apache-airflow-providers-ftp 2.0.1 apache-airflow-providers-http 2.0.3 apache-airflow-providers-imap 2.2.0 apache-airflow-providers-sqlite 2.1.0 apispec 3.3.2 argcomplete 1.12.3 attrs 20.3.0 Babel 2.9.1 billiard bleach 4.1.0 blinker 1.4 cachetools 5.0.0 cattrs 1.6.0 celery 5.2.2 certifi 2021.10.8 cffi 1.15.0 charset-normalizer 2.0.12 click 8.0.4 click-didyoumean 0.3.0 click-plugins 1.1.1 click-repl 0.2.0 clickclick 20.10.2 colorama 0.4.4 colorlog 5.0.1 commonmark 0.9.1 coverage 6.3.2 croniter 1.0.15 cryptography 36.0.1 datacompy 0.7.3 defusedxml 0.7.1 dill 0.3.4 dnspython 2.2.0 docker 5.0.3 docutils 0.16 email-validator 1.1.3 Flask 1.1.2 Flask-AppBuilder 3.4.4 Flask-Babel 2.0.0 Flask-Caching 1.10.1 Flask-JWT-Extended 3.25.1 Flask-Login 0.4.1 Flask-OpenID 1.3.0 Flask-SQLAlchemy 2.5.1 Flask-WTF 0.14.3 flower 1.0.0 gevent 21.12.0 google-auth 2.6.0 graphviz 0.19.1 greenlet 1.1.2 gunicorn 20.1.0 h11 0.12.0 httpcore 0.13.7 httpx 0.19.0 humanize 4.0.0 idna 3.3 importlib-metadata 4.11.1 importlib-resources 5.4.0 inflection 0.5.1 iniconfig 1.1.1 iso8601 1.0.2 itsdangerous 1.1.0 jeepney 0.7.1 Jinja2 3.0.3 jsonschema 3.2.0 keyring 23.5.0 kombu 5.2.3 kubernetes 22.6.0 lazy-object-proxy 1.7.1 lockfile 0.12.2 Mako 1.1.6 Markdown 3.3.6 MarkupSafe 2.1.0 marshmallow 3.14.1 marshmallow-enum 1.5.1 marshmallow-oneofschema 3.0.1 marshmallow-sqlalchemy 0.26.1 numexpr 2.8.1 numpy 1.22.2 oauthlib 3.2.0 openapi-schema-validator 0.2.3 openapi-spec-validator 0.4.0 ordered-set 4.1.0 packaging 21.3 pandas 1.3.5 pendulum 2.1.2 pip 22.0.3 pkginfo 1.8.2 pluggy 1.0.0 prettytable 3.1.1 prison 0.2.1 prometheus-client 0.13.1 prompt-toolkit 3.0.28 psutil 5.9.0 psycopg2-binary 2.9.3 py 1.11.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycparser 2.21 Pygments 2.11.2 PyJWT 1.7.1 pyparsing 3.0.7 pyrsistent 0.18.1 pytest 7.0.1 pytest-cov 3.0.0 python-daemon 2.3.0 python-dateutil 2.8.2 python-nvd3 0.15.0 python-slugify 4.0.1 python3-openid 3.2.0 pytz 2021.3 pytzdata 2020.1 PyYAML 6.0 readme-renderer 32.0 redis 3.5.3 requests 2.27.1 requests-oauthlib 1.3.1 requests-toolbelt 0.9.1 rfc3986 1.5.0 rich 11.2.0 rsa 4.8 SecretStorage 3.3.1 semantic-version 2.9.0 setproctitle 1.2.2 setuptools 59.0.1 setuptools-rust 1.1.2 six 1.16.0 sniffio 1.2.0 SQLAlchemy 1.4.31 SQLAlchemy-JSONField 1.0.0 SQLAlchemy-Utils 0.38.2 swagger-ui-bundle 0.0.9 tables 3.6.1 tabulate 0.8.9 tenacity 8.0.1 termcolor 1.1.0 text-unidecode 1.3 tomli 2.0.1 tornado 6.1 tqdm 4.62.3 twine 3.8.0 typing_extensions 4.1.1 unicodecsv 0.14.1 urllib3 1.26.8 vine 5.0.0 wcwidth 0.2.5 webencodings 0.5.1 websocket 0.2.1 websocket-client 1.2.3 Werkzeug 1.0.1 wheel 0.37.0 WTForms 2.3.3 zipp 3.7.0 zope.event 4.5.0 zope.interface 5.4.0



Deployment details

I have three machines Machine A: run airflow webserver and scheduler B and C machines: run celery worker All operators run on the k8s cluster through k8s_config

Anything else

uranusjr commented 2 years ago

Looks similar to #12136 and #15990.

chengzi0103 commented 2 years ago

@uranusjr I noticed that @raphaelauv does not have this problem when using apache-airflow-providers-cncf-kubernetes version greater than 3.0.0 but my version is obviously 3.0.2 I don't know where the problem is

chengzi0103 commented 2 years ago

download = KubernetesPodOperator( task_id='X12', is_delete_operator_pod=True, get_logs=True, image=images_name, namespace=name_space, name=f'download_api', cmds=['python cmd'], arguments=[f"--account={config.dags['daily_01_download_process']['account']}"], in_cluster=False, volumes=[get_k8s_pod_mount_volume_of_host(mount_local_path)], volume_mounts=[get_k8s_pod_mount_volume_of_worker(remote_path), ], )

potiuk commented 2 years ago

Please downgrade your kubernetes library version to 11.0.0 (you have kubernetes 22.6.0) @chengzi0103 . You apparently did not use constraints when you installed airflow and providers. We are just adding a protection to make it harder to upgrade to incompatible version of kubernets library, and we just yanked 3.0.2 cncf.kubernetes provider for people who did it - but using constraints as described in https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html is the best way to have stable airflow installation (that's the only officially supported way of installing Airflow).

dstandish commented 2 years ago

hi @chengzi0103 it appears that your task likely failed for reasons unrelated to the traceback shown.

sometimes logs read is interrupted due to connection issue. in that case we catch the error and resume logging. and that's what that traceback is about. but note that it is only a warning, and that the logs later resume, and the task doesn't fail for another 8 minutes.

your issue report inspired us to move that traceback to the DEBUG level in https://github.com/apache/airflow/pull/22595, so as not to cause false alarm or confusion.

Yes my setting is

Yes my setting is


Thank you for your answers I will try to use larger clusters and resources later this problem occurs less often I will use your suggestions Thanks again