apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.06k stars 14.29k forks source link

KubernetesPodOperator : TypeError: 'NoneType' object is not iterable #19369

Closed raphaelauv closed 2 years ago

raphaelauv commented 3 years ago

Apache Airflow Provider(s)

cncf-kubernetes

Versions of Apache Airflow Providers

apache-airflow-providers-cncf-kubernetes==2.0.2

Apache Airflow version

2.1.2

Operating System

GCP Container-Optimized OS

Deployment

Composer

Deployment details

No response

What happened

[2021-11-02 14:16:27,086] {pod_launcher.py:193} ERROR - Error parsing timestamp. Will continue execution but won't update timestamp
[2021-11-02 14:16:27,086] {pod_launcher.py:149} INFO - rpc error: code = NotFound desc = an error occurred when try to find container "8f8c2f3dce295f70ba5d60175ff847854e05ab288f7efa3ce6d0bd976d0378ea": not found
[2021-11-02 14:16:28,152] {taskinstance.py:1503} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1158, in _run_raw_task
    self._prepare_and_execute_task_with_callbacks(context, task)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1333, in _prepare_and_execute_task_with_callbacks
    result = self._execute_task(context, task_copy)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1363, in _execute_task
    result = task_copy.execute(context=context)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 367, in execute
    final_state, remote_pod, result = self.create_new_pod_for_operator(labels, launcher)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py", line 521, in create_new_pod_for_operator
    final_state, remote_pod, result = launcher.monitor_pod(pod=self.pod, get_logs=self.get_logs)
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/utils/pod_launcher.py", line 154, in monitor_pod
    if not self.base_container_is_running(pod):
  File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/utils/pod_launcher.py", line 215, in base_container_is_running
    status = next(iter(filter(lambda s: s.name == 'base', event.status.container_statuses)), None)
TypeError: 'NoneType' object is not iterable

in the GKE logs I see

airflow-XXX.XXXXX.b14f7e38312c4e83984d1d7a3987655a
"Pod ephemeral local storage usage exceeds the total limit of containers 10Mi. "

So the pod failed because I set to low limit for local storage , but the airflow operator should not raise an exception but fail normally.

What you expected to happen

The KubernetesPodOperator should managed this kind of error

How to reproduce

launch KubernetesPodOperator with a very small local disk limit and run a container using more than this limit

 KubernetesPodOperator(
        task_id="XXX",
        name="XXXXXXXXX",
        namespace="default",
        resources={'limit_memory': "512M",
                           'limit_cpu': "250m",
                           'limit_ephemeral_storage': "10M"},
        is_delete_operator_pod=True,
        image="any_image_using_more_than_limit_ephemeral_storage")

Anything else

No response

Are you willing to submit PR?

Code of Conduct

potiuk commented 2 years ago

Sounds like an interesting edge case and rather simple to debug and track - would you lke maybe to make a PR with a fix for it @raphaelauv ?

ulisesojeda commented 2 years ago

@raphaelauv could you provide an image example?

raphaelauv commented 2 years ago

@ulisesojeda

minimal reproductive example :

with dag:
    fail_cmd = "fallocate -l 100M toto.img && sleep 60 && echo finish"
    k = KubernetesPodOperator(
        task_id="task-one",
        namespace="default",
        name="airflow-test-pod",
        image='alpine',
        resources={'limit_ephemeral_storage': f"10M"},
        cmds=["sh", "-c", fail_cmd],
        is_delete_operator_pod=True,
        get_logs=True,
        in_cluster=False)
ulisesojeda commented 2 years ago

Thanks @raphaelauv. I've opened this PR https://github.com/apache/airflow/pull/19713

@potiuk could you check it please?