I get the following error a lot on my airflow scheduler pods:
{kubernetes_executor_utils.py:121} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 112, in run
self.resource_version = self._run(
^^^^^^^^^^
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 168, in _run
for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 725263658 (725300129)
Process KubernetesJobWatcher-8:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 112, in run
self.resource_version = self._run(
^^^^^^^^^^
File "/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor_utils.py", line 168, in _run
for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs):
File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/watch/watch.py", line 182, in stream
raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 725263658 (725300129)
When this error appears relatively many times on my airflow scheduler pods, All my DAG runs become very slow- This is expressed in the fact that the amount of my "scheduled" slots is very high and in contrast the amount of my "'queued" and "running" slots is very low (about 15 slots together) even though I have defined 128 slots.
Also my resource utilization in my namespace is very low (20% cpu and memory usage) so the problem is not resources either.
NOTE: I use the package "apache-airflow-providers-cncf-kubernetes" on version 8.0.0 as required for Airflow 2.8.2 according to the constraints.
What you think should happen instead?
I think Airflow should know how to handle this error so that even when the error is thrown, the scheduler should continue to work properly and not "freeze".
How to reproduce
I think it would happen on any deployment in this version of Airflow with running DAGs.
Operating System
rhel 8
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==8.0.0
Deployment
Other
Deployment details
We are in a private cloud with constraints, we took the most of the chart but handled the constraints ourselves.
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.2
What happened?
I get the following error a lot on my airflow scheduler pods:
When this error appears relatively many times on my airflow scheduler pods, All my DAG runs become very slow- This is expressed in the fact that the amount of my "scheduled" slots is very high and in contrast the amount of my "'queued" and "running" slots is very low (about 15 slots together) even though I have defined 128 slots.
Also my resource utilization in my namespace is very low (20% cpu and memory usage) so the problem is not resources either.
NOTE: I use the package "apache-airflow-providers-cncf-kubernetes" on version 8.0.0 as required for Airflow 2.8.2 according to the constraints.
What you think should happen instead?
I think Airflow should know how to handle this error so that even when the error is thrown, the scheduler should continue to work properly and not "freeze".
How to reproduce
I think it would happen on any deployment in this version of Airflow with running DAGs.
Operating System
rhel 8
Versions of Apache Airflow Providers
apache-airflow-providers-cncf-kubernetes==8.0.0
Deployment
Other
Deployment details
We are in a private cloud with constraints, we took the most of the chart but handled the constraints ourselves.
Anything else?
No response
Are you willing to submit PR?
Code of Conduct