InvalidChunkLength(got length b'', 0 bytes read)) issue in airflow scheduler which is using kubernetes executor

shubham-vonage commented 2 months ago

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==8.0.1 kubernetes==29.0.0 kubernetes_asyncio==29.0.0

kubernetes version: Client Version: v1.28.0 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.28.9

Apache Airflow version

2.8.3

Operating System

Linux

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened

I have deployed the airflow using official helm chart in my cluster. everything seems to be working fine, but when the scheduler is idle it gives this error:

ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 112, in run
    self.resource_version = self._run(
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 168, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 178, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for segment in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
Process KubernetesJobWatcher-7:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 112, in run
    self.resource_version = self._run(
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 168, in _run
    for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 178, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.9/site-packages/kubernetes/watch/watch.py", line 56, in iter_resp_lines
    for segment in resp.stream(amt=None, decode_content=False):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 857, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.9/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

The scheduler works normally but this error is continuously generated. The same chart doesn't give any error in other cluster. What might be the issue.

P.S.: I have read this issue : https://github.com/apache/airflow/issues/33066 but didnt get any conclusion on this.

What you think should happen instead

The scheduler should not give any error. The kubewatcher seems to be the core component for this issue.

How to reproduce

Use official helm chart 1.13.1 with these configs:

airflowVersion: 2.8.3

# Ingress configuration
ingress:
  enabled: false
  # Enable web ingress resource
  web:
    enabled: True
    # Annotations for the web Ingress

config:
  core:
    parallelism: 32
    max_active_tasks_per_dag: 16
    max_active_runs_per_dag: 16
    dagbag_import_timeout: 100
    dag_file_processor_timeout: 50
    min_serialized_dag_update_interval: 60
    min_serialized_dag_fetch_interval: 30

pgbouncer:
  # Enable PgBouncer
  enabled: true

scheduler:
  replicas: 1
  resources:
    limits:
      cpu: 1500m
      memory: 1500Mi
    requests:
      cpu: 1000m
      memory: 1200Mi

  livenessProbe:
    initialDelaySeconds: 120
    timeoutSeconds: 30
    failureThreshold: 5
    periodSeconds: 300
    command: ~

webserver:
  replicas: 1
  resources:
    limits:
      cpu: 1000m
      memory: 1500Mi
    requests:
      cpu: 500m
      memory: 1200Mi

  livenessProbe:
    initialDelaySeconds: 15
    timeoutSeconds: 30
    failureThreshold: 20
    periodSeconds: 5

  readinessProbe:
    initialDelaySeconds: 15
    timeoutSeconds: 30
    failureThreshold: 20
    periodSeconds: 5

  webserverConfig: |
    from flask_appbuilder.security.manager import AUTH_DB
    # use embedded DB for auth
    AUTH_TYPE = AUTH_DB

dags:
  persistence:
    enabled: true
    size: 10Gi
    accessMode: ReadWriteMany

logs:
  persistence:
    enabled: true
    size: 10Gi

postgresql:
  enabled: True

Anything else

This occurs continuously whenever the airflow scheduler is idle (it is not running any dags)

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

boring-cyborg[bot] commented 2 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

belek commented 1 month ago

Hello, does anyone know how to handle such error? Have the same problem.

shubham-vonage commented 1 month ago

The issue is with cncf-provider-kubernetes. update it to 8.4 version. The issue will be solved

apache / airflow