Not easy to know what is in context passed to operators and call_backs

zazaho commented 2 years ago

What do you see as an issue?

We are building a pipeline using airflow. We are using KubernetesPodOperators and KubernetesExecutor. We are anticipating different reasons that a task might fail. I would like to be able to explore how much information I can get about the reason and the moment the task failed from the context passed to he on_failure callback function. However, I find it difficult to understand what will be passed in context. The only documentation I have been able to find that relates to this is a page describing templated variables. This seems not completely appropriate. Most of the variables description is simply saying things like {{ ti }} task_instance object, which is not enough for me to know what information I can get from inspecting these.

Solving the problem

Perhaps there is better documentation that I missed, otherwise I believe this is a documentation issue that would be relatively high on my priority list.

If documentation does exist, I think it should be made easier to find. If not, I think a documentation chapter on the operator context and how to use it (preferable with some concrete examples of usage) would be a huge gain for the project, IMO.

Anything else

No response

Are you willing to submit PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 2 years ago

I think there are a few good points there. But I think it is difficult to know if the documentation is enough. You can always say this should be "better described" or "clearer" or "better examples", but it is difficult to be judged by people who maintain it, so I mark it as a "good first issue" for someone who would like to pick that task and is a used.

And in this context - I think you should attempt to provide a PR updating it. Airflow is creaed by > 2200 mostly people like you who miss something and then add it to either code or documentation or both. And I think you are one of the best people who know how and where people like would be searchig for such information.

I am happy to guide you in the process @zazaho. Create PR and myself and others will be reviewing it and help you to make your first contribution. Also this is great opportunity to actually pay back for the free software you use and join the people who - mostly in their free time - contribute to airflow.

You can go to https://airflow.apache.org/docs/apache-airflow/stable/templates-ref.html and click "Suggest a change on this page" button and you will be able to open PR where you can update the docs. Same with other changes.

It is all rather easy - you just need to see how other .rst files in our documentation - even if you do not know .rst syntax. You can refer to classes such as TaskInstance (you will find in other .rst files how to refer to classes, you can add explanation, you can also add some examples (also see how it is done in other .rst.

And you can also find some more information about the Context class here: https://github.com/apache/airflow/blob/main/airflow/utils/context.pyi. where the context typing is decribed in Python typing mode - it is used for autocomplete when you use context in other

From your description I think you could add:

paragraph in "templates-red" that those fields here describe context passed elsewhere (and possibly list all those places like operators and callbacks)
add some examples (maybe from your real dags - that woudl be best if you can do it as you likely already have some good examples you can provide)
Explain that context is defined in Context.pyi for those curious and that with modern IDE they can get it autocomplete'able
maybe even you could add dosctrings in the Context.pyi to get the documentation directly there - that could help those who use autocomplete

Adding such documentation is also a great opportunity to learn - you might find other places where things are documented (for example ds etc. are likely described in Scheduling section) and bring the explanation from there to the reference.

Lookling forward to such contribution (and even if not you, then maybe someone will be able to pick that one up and make nice PRs clarifying all those).

zazaho commented 2 years ago

Thank you for your very prompt reply. I understand your suggestion for a contribution instead of a request. I have contributed to open source projects in the manner you suggest. However, in the current situation I am a bit at loss. My request for documentation is prompted by the fact that I do not know how to use the context dictionary to accomplish certain tasks. It is not clear to me what examples I could provide given my lack of this knowledge :).

I will explore the autocomplete functionality of PyCharm to see if this helps me out. If things are getting clearer for me I will contribute as you suggest via a PR.

If you have knowledge of any working examples of the following situations, I would be very happy to see them and integrate them in the PR: The task failed because the workerpod did not start in time The task filed because the k8s-cluster failed to auto-scale The task failed because an out-of-memory situation occured (OOMKiller) The task failed because a running pod was stopped on a preemptible node

potiuk commented 2 years ago

Then you are asking for something completely different. Context does not contain this information. This is the execution context of the task, not the context of how runner has been killed. All those:

The task failed because the workerpod did not start in time
The task filed because the k8s-cluster failed to auto-scale
The task failed because an out-of-memory situation occured (OOMKiller)
The task failed because a running pod was stopped on a preemptible node

We do not have this information as this is purely a deployment specific thing and you will not see it in task callbacks, because Airlfow task logic should be independent from the underlying executor and runner - it shoudl not matter whether you run KubernetesExecutor, CeleryExecutor, CeleryKubernetesExecutor or Local Executor.

What you are talking about is very specific Kubernetes-specific behaviour, including the fact that you might have preemptible node failures. We do. not expose those to tasks. The only information why task failed are placed in logs and tasks canot really react differently in those cases.

If you want to implement such custom behaviour, what you ned to do is to extend KubernetesExecutor and create your own executor that will analyse the status returned by your K8S cluster in either "run_pod_async" or "monitor_pod" methods and raise different exceptions to signal what needs to be done. For example your code could raise AirflowException if you want the task to retry on certain failure type or AirflowFailException if you don't. And you can write any other cusom code if you want in your custom variant of KubernetesExecutor.

potiuk commented 2 years ago

Converting it into discussion As this is really a different thing entirely (context is not really a place to keep the deployment-specific information about failure).

apache / airflow