Closed tommyhutcheson closed 9 months ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Expected key current-context in kube-config
Deferrable and no deferrable operators use the exact same method to load the kube config file since 7.0.0, I'm surprised that you have this exception only in deferrable mode.
Since your config file doesn't have the current-context (default context), I wonder if you added cluster_context
to your K8S connection? You can also add it to your task if you want to test.
Hi @hussein-awala
I have tried quite a few different configurations at this point but there just seems to be an issue here.
when running the below dag the only task that completes is the deferrable-false task, the other two look to be running the code and output hello-world with deferrable set and I see the dag status change to purple however the runs are failing still with the error below, I have checked the kube-config file and I can see there is a key contexts. I have re-opened by Google support case to asking their product team to test the dag themselves and install composer-2.4.3-airflow-2.5.3. If there is another suggests please let me know.
[2023-10-13, 14:38:27 UTC] {standard_task_runner.py:100} ERROR - Failed to execute job 48004 for task deferrable-true-extended-conf (Invalid kube-config file. Expected key contexts in kube-config;
743598)
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
import airflow
from airflow import DAG
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'tommy_test_kub_simple_dag',
default_args=default_args,
description='liveness monitoring dag',
schedule_interval='*/10 * * * *',
max_active_runs=2,
catchup=False,
dagrun_timeout=timedelta(minutes=10),
) as dag:
task1 = KubernetesPodOperator(
name="deferrable-true",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world')"],
task_id="deferrable-true",
config_file="/home/airflow/composer_kube_config",
deferrable=True,
in_cluster=False
)
task2 = KubernetesPodOperator(
name="deferrable-false",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world')"],
task_id="deferrable-false",
config_file="/home/airflow/composer_kube_config",
deferrable=False,
in_cluster=False
)
task3 = KubernetesPodOperator(
name="deferrable-true-extended-conf",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world')"],
task_id="deferrable-true-extended-conf",
kubernetes_conn_id="kubernetes_default",
deferrable=True,
in_cluster=False,
cluster_context="gke_my_orchestrater_id",
config_file="/home/airflow/composer_kube_config",
)
task1
task2
task3
@tommyhutcheson what do you think about avoiding those files and providing Kube Config in JSON, i think it should be possible. Having blocking operations (eg. files handling) in deferrable mode is in most cases a bad design, and I think community should aim to avoid that everywhere. Let me know if it was possible for you to provide this configuration via JSON.
I created a BashOperator which prints the kube config in the composer env. The log shows it actually has current-context
. So it is not really a problem with the kube config itself, but somehow the config was not used correctly in async mode.
from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
dag = DAG('print_kube_config', description='Print Kube Config', schedule_interval='@once', start_date=datetime(2023, 8, 17))
bash_operator = BashOperator(task_id='print_kube_config_task', bash_command='cat /home/airflow/composer_kube_config', dag=dag)
bash_operator
I reproduced the problem with the following DAG. Very interesting, the pod / container actually succeeded, we can see Container logs: hello world deferrable=true
in the log but the error happened after that. In other words, the problem is that pod succeeded, but task failed.
DAG:
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
import airflow
from airflow import DAG
from datetime import timedelta
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'kpo',
default_args=default_args,
description='KPO',
schedule_interval='@once',
max_active_runs=2,
catchup=False,
dagrun_timeout=timedelta(minutes=10),
) as dag:
task1 = KubernetesPodOperator(
name="deferrable-true",
image="python:3.11-slim",
cmds=['python', '-c', "print('hello world deferrable=true')"],
task_id="deferrable-true",
config_file="/home/airflow/composer_kube_config",
deferrable=True,
in_cluster=False
)
task1
Logs:
...
Running: ['airflow', 'tasks', 'run', 'kpo', 'deferrable-true', 'scheduled__2023-11-17T00:00:00+00:00', '--job-id', '5896', '--raw', '--subdir', 'DAGS_FOLDER/kpo.py', '--cfg-path', '/tmp/tmp5vwp6_0t']
...
Container logs: hello world deferrable=true
2023-11-17 05:03:48.808 UTC
Container logs:
2023-11-17 05:03:48.826 UTC
Pod deferrable-true-b6ju3o4y has phase Pending
2023-11-17 05:03:50.852 UTC
Deleting pod: deferrable-true-b6ju3o4y
2023-11-17 05:03:51.075 UTC
Task failed with exception Traceback (most recent call last): File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 648, in execute_complete raise AirflowException(event["message"]) airflow.exceptions.AirflowException: Invalid kube-config file. Expected key current-context in kube-config
2023-11-17 05:03:51.083 UTC
Marking task as FAILED. dag_id=kpo, task_id=deferrable-true, execution_date=20231117T000000, start_date=20231117T050342, end_date=20231117T050351
2023-11-17 05:03:51.121 UTC
Failed to execute job 5896 for task deferrable-true (Invalid kube-config file. Expected key current-context in kube-config; 1262793)
Anybody knows the possible reason? Or is there a way to enable debug logs in KubernetesPodTrigger
?
@hussein-awala, any idea about the observations above?
I feel like the error messages and logs need to be improved. The current ones are not sufficient to figure out what went wrong in the trigger/hook.
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
Friendly ping!
@potiuk @hussein-awala can you please chime in?
Issue ongoing keeping ticket open with author comment.
The lack of more details seems to be because the message is not coming from airflow but from the POD. The message is "just" displayed by the airflow's KPO and the error is somewhere on the POD.
There are likely two ways you can address your problem @tommyhutcheson:
if you can upgrade to the latest version of Kubernetes Provider then you have a chance that a) problem has been fixed there b) logging will be more comprehensive. You are using pretty old version of the provider (7.3.0) where 7.11.0 is already available and there were numerous improvements and bugfixes implemented there See the https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/changelog.html . One of the changes implemented in 7.7.0 for example improved multi-line handling of logs by KPO https://github.com/apache/airflow/pull/34412 but there were many more and if you want to be more certain if you should upgrade, I recommend you to go and read in detail the changelog. Each change there contains reference to PR# and you can find and look at the PR where you will find detailed description of all the changes - since it is an open-source project and all is "in the clear".
Another option is to take a look at the detailed logs of your k8s Pods that are failing. I do not know KPO that well and what kind of flexibility the K8S interface Composer gives you, but I believe you can set KPO in the mode that it wil not be clearing PODs immediately after running and you will be able to inspect much more information - seems you recognize the need of having more detailed information what's going on, so if you would like to analyse it in detail before upgrading, it's likely the way to go.
However, I'd urge you to upgrade everything you can first. Many of our users experience problems that have long been solved and in this case there are quite many issues implemented since your version. You can either attempt it in the way that you want to be sure that you should upgrade (in which case I advise you to do detailed analysis of the changelog) or just upgrade and see if you still experience the problem. The latter is usually faster, and takes less time - both for you and volunteers here who have no time to go through detailed list of changelog just to make sure partcular problems have been fixed. This is an open-source project, so people here help when they have time (they are not paid for it) and in cases like that, it's quite a bit on the user to make sure to make the effort to ugprade to latest version in case they experience problems in later versions - especially in case there were many fixes since.
Please let us know after you investigate and (hopefully) upgrade how the things go, so that we can (hopefully) close the ticket - in the meantime I mark it as pending response.
Sorry for the delay. I will try to reproduce it and implement a fix before the next providers' release wave.
Could you provide the Kubernetes conn you are using in your operator? (you can hide the confidential information)
I tested KubernetesPodTrigger
and AsyncKubernetesHook
with many configuration combinations, and they both worked as expected.
I was able to reproduce the exception in only one case; are you sure the Kubernetes configuration file exists in the Triggerer pod and in the same path as the worker? In all the reports you provided, you confirmed that the config file is present in the worker (for example, when you tested with BashOperator
). Still, no one mentioned the Triggerer in his investigation.
@hussein-awala thanks for your input! I will double check whether the config exists in the triggerer pod. But if that was the cause, do you know why in my previous repro, the operator container/pod succeeded with log hello world deferrable=true
, since the trigger should have failed to trigger it.
@hussein-awala thanks for your input! I will double check whether the config exists in the triggerer pod. But if that was the cause, do you know why in my previous repro, the operator container/pod succeeded with log
hello world deferrable=true
, since the trigger should have failed to trigger it.
It could work with a version < 7.0.0, but since https://github.com/apache/airflow/issues/31322, the behavior was changed by stopping converting the file to dict and providing it to the Trigger, instead, we provide the config file path and we load it in the Trigger. This PR was a bug fix, also there was another reason for it (will explain more later).
It could work with a version < 7.0.0, but since #31322, the behavior was changed by stopping converting the file to dict and providing it to the Trigger, instead, we provide the config file path and we load it in the Trigger. This PR was a bug fix, also there was another reason for it (will explain more later).
That would certainly explain the behaviour .. Nice one @hussein-awala :)
It could work with a version < 7.0.0, but since #31322 ...
It doesn't seem to be related, because the original problem occurred with 7.3.0, and I also reproduced the error with 7.9.0 in composer-2.5.2-airflow-2.6.3. I'm still trying to figure out how to verify the file exists in Trigger Pod. I'd appreciate if someone can provide a sample code for that!
It doesn't seem to be related, because the original problem occurred with 7.3.0, and I also reproduced the error with 7.9.0 in composer-2.5.2-airflow-2.6.3.
I said that deferrable mode works fine in a version <7.0.0 without adding the config file to the triggerer; reproducing the problem with 7.3.0 and 7.9.0 does not contradict what I said.
I'm still trying to figure out how to verify the file exists in Trigger Pod. I'd appreciate if someone can provide a sample code for that!
Quickly checking, I would say that providing extra files to the triggerer is impossible. I recommend contacting the support team of GCP to check with them if this is possible or not, and how they can support it if it's not supported.
Before https://github.com/apache/airflow/pull/29498 the flow was as following:
config_file
was read and deserialized to a map.This way 2 things were achieved:
@hussein-awala Why has this process been reverted in https://github.com/apache/airflow/pull/29498 ?
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This PR was a bug fix, also there was another reason for it (will explain more later).
For the second reason, here is it: https://www.cve.org/CVERecord?id=CVE-2023-51702
We're working on an improvement for the trigger data stored in the database, once it's released, we will check how can we fix this issue.
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
I am experiencing the same issue today, it appears there is no clear resolution in the comments above. Was this issue resolved? If so, how?
same problem with
airflow==2.10.2 apache-airflow-providers-cncf-kubernetes==9.0.0
kind 0.24.0
kind get kubeconfig --internal > conf/kube_conf
airflow connection
"kubernetes_default": {
"conn_type": "kubernetes",
"extra": "{\"extra__kubernetes__in_cluster\": false, \"extra__kubernetes__kube_config_path\": \"/opt/airflow/include/.kube/config\", \"extra__kubernetes__namespace\": \"default\", \"extra__kubernetes__cluster_context\": \"kind-kind\", \"extra__kubernetes__disable_verify_ssl\": false, \"extra__kubernetes__disable_tcp_keepalive\": false, \"xcom_sidecar_container_image\": \"alpine:3.16.2\"}"
}
[2024-10-30, 16:03:51 UTC] {base.py:84} INFO - Retrieving connection 'kubernetes_default'
[2024-10-30, 16:03:51 UTC] {pod.py:1138} INFO - Building pod airflow-test-pod-xb472z6z with labels: {'dag_id': 'kubernetes_dag', 'task_id': 'task-one', 'run_id': 'manual__2024-10-30T160350.5676190000-e2f6ac7ec', 'kubernetes_pod_operator': 'True', 'try_number': '1'}
[2024-10-30, 16:03:51 UTC] {taskinstance.py:288} INFO - Pausing task as DEFERRED. dag_id=kubernetes_dag, task_id=task-one, run_id=manual__2024-10-30T16:03:50.567619+00:00, execution_date=20241030T160350, start_date=20241030T160351
[2024-10-30, 16:03:51 UTC] {taskinstance.py:340} ▼ Post task execution logs
[2024-10-30, 16:03:51 UTC] {local_task_job_runner.py:260} INFO - Task exited with return code 100 (task deferral)
[2024-10-30, 16:03:51 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end
[2024-10-30, 16:03:52 UTC] {pod.py:160} INFO - Checking pod 'airflow-test-pod-xb472z6z' in namespace 'default'.
[2024-10-30, 16:03:52 UTC] {base.py:84} INFO - Retrieving connection 'kubernetes_default'
[2024-10-30, 16:03:52 UTC] {kube_config.py:515} WARNING - Config not found: /home/airflow/.kube/config
[2024-10-30, 16:03:52 UTC] {triggerer_job_runner.py:631} INFO - Trigger kubernetes_dag/manual__2024-10-30T16:03:50.567619+00:00/task-one/-1/1 (ID 10) fired: TriggerEvent<{'name': 'airflow-test-pod-xb472z6z', 'namespace': 'default', 'status': 'error', 'message': 'Invalid kube-config file. Expected key contexts in kube-config', 'stack_trace': 'Traceback (most recent call last):\n File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 162, in run\n state = await self._wait_for_pod_start()\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py", line 223, in _wait_for_pod_start\n pod = await self.hook.get_pod(self.pod_name, self.pod_namespace)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 754, in get_pod\n async with self.get_conn() as connection:\n ^^^^^^^^^^^^^^^\n File "/usr/local/lib/python3.12/contextlib.py", line 210, in __aenter__\n return await anext(self.gen)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 741, in get_conn\n kube_client = await self._load_config() or async_client.ApiClient()\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py", line 711, in _load_config\n await async_config.load_kube_config(\n File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/config/kube_config.py", line 603, in load_kube_config\n loader = _get_kube_config_loader_for_yaml_file(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/config/kube_config.py", line 567, in _get_kube_config_loader_for_yaml_file\n return KubeConfigLoader(\n ^^^^^^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/config/kube_config.py", line 150, in __init__\n self.set_active_context(active_context)\n File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/config/kube_config.py", line 162, in set_active_context\n self._current_context = self._config[\'contexts\'].get_with_name(\n ~~~~~~~~~~~~^^^^^^^^^^^^\n File "/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/config/kube_config.py", line 448, in __getitem__\n raise ConfigException(\nkubernetes_asyncio.config.config_exception.ConfigException: Invalid kube-config file. Expected key contexts in kube-config\n'}>
[2024-10-30, 16:03:54 UTC] {local_task_job_runner.py:123} ▼ Pre task execution logs
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
dag = DAG(
dag_id="kubernetes_dag",
schedule_interval=None,
start_date=days_ago(1),
)
with dag:
cmd = "echo toto && sleep 30 && echo finish && exit 1"
KubernetesPodOperator(
task_id="task-one",
namespace="default",
image_pull_policy="Never",
kubernetes_conn_id="kubernetes_default",
name="airflow-test-pod",
image="alpine:3.16.2",
cmds=["sh", "-c", cmd],
deferrable=True,
poll_interval=100,
do_xcom_push=True,
)
@MCMcCallum @raphaelauv -> when you encounter closed issue (Especially closed months ago) with similar description, the best course of action is to open a new one - and describe your circumstances and case - ideally referring to the old issue as related.
This allows to focus on your issue. Which might or might not be related - even if error message is similar. And you have a chance to restart the issue, focusing on - likely - much more fresh circumstances - your Airflow version, your K8s provider version etc. When you add "another" set of things to existing closed issue, it's entirely unclear for anyone who is looking at it - how to reproduce it. Is it the same issue? Or different? Should I look at the original report or a new one? etc.
Also by opening the isssue You own it as an author - and when maintainer ask questions or mark it as "needs more information" it's clear that it's you who should provide it - not the original author, and it's also much more likely that you will do, because it is "fresh".
So I heartily recommend to do so.
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Hello We are trying to use the deferrable option with the KubernetesPodOperator for the first time but we can't get past the error
Invalid kube-config file. Expected key current-context in kube-config
when using deferrable=True.We run airflow via cloud composer in GCP and first upgraded to composer-2.3.2-airflow-2.5.1 but had the issue so upgraded again to composer-2.4.3-airflow-2.5.3 having seen some posts about a fix in version 7.0.0 but we're still faced with the issue.
I have stripping the DAG operator back to basics :
operator is being imported from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
Error:
Cloud composer 2 guide states that we need to set the config_file to /home/airflow/composer_kube_config but the error we get seems to imply that the file is missing an expected key current-context. I raised this with google who suggested raisin this ticket here and stating that their product team confirmed that this issue is due to a problem with Airflow.
What you think should happen instead
Dag runs with deferrable=True parameter set, in this case printing hello world to the logs.
How to reproduce
Deploy the sample dag to airflow 2.5.3 if possible using cloud composer 2.4.3
Operating System
debian:11-slim
Versions of Apache Airflow Providers
Deployment
Google Cloud Composer
Deployment details
composer-2.4.3-airflow-2.5.3
Anything else
No response
Are you willing to submit PR?
Code of Conduct