apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37k stars 14.27k forks source link

Make Airflow error messages more specific, clear and actionable #43171

Open omkar-foss opened 2 weeks ago

omkar-foss commented 2 weeks ago

Description

As per users' feedback in the Airflow Debugging Survey 2024, around 41.7% respondents don't consider error messages as actionable. Overall feedback also suggests that users find some error messages vague and confusing.

Use case/motivation

Goals for this issue are the following:

Related issues

Parent Issue: https://github.com/apache/airflow/issues/40975

Are you willing to submit a PR?

Code of Conduct

hterik commented 2 weeks ago

I can recommend this guide from Google about writing good error messages: https://developers.google.com/tech-writing/error-messages. The rest of the courses in that book are also really good btw.

an error like Celery command failed on host can be transformed or displayed with something like "Please check your DAG processor timeout variable for this".

Actionable errors are good, but has to be done very carefully, because if it gives misleading advice it will lead users down chasing the wrong rabbit hole. For example this log in standard_task_runner.py is most of the time not due to memory running out: "Job %s was killed before it finished (likely due to running out of memory)",. I've seen our engineers chasing memory issues in vain countless of times because of that message. (yes we should have filed a PR :smile:)

potiuk commented 1 week ago

but has to be done very carefully, because if it gives misleading advice it will lead users down chasing the wrong rabbit hole. For example this log in standard_task_runner.py is most of the time not due to memory running out: "Job %s was killed before it finished (likely due to running out of memory)",. I've seen our engineers chasing memory issues in vain countless of times because of that message.

I am big fan of "always tell the user what action from their side the error implies.". Agree things can be misleading and re the case you mentioned - I cannot find it now (I think I discussed it in the past), but I think in case of such complicated and multi-possible-root-cause we should explain what's going on and link to a FAQ page on Airflow explaining possible reasons. This way when you have the error, and we find other reasons and more detailed explanations what could be wrong and how to remediate it - we can always update the docs and add more information that will be useful for many past versions of airflow that people will have.

(yes we should have filed a PR 😄)

Absolutely :)

omkar-foss commented 6 days ago

Have a suggestion for multi-possible-root-cause issues - we can print Airflow error code with the error message e.g. AERR055: Job 10 was killed before it finished and can have an error code mapping with possible root causes like (just examples, not real causes):

Error Code Possible Commonly Observed Causes
AERR055 1) Ran out of memory
2) Job was stuck and killed after timeout
3) Job being run on Spot Instance Node (K8S on EKS)

Since error codes are shareable and easily searchable, it would be useful for team collaboration as well (e.g. instead of me saying "I'm looking into the error Job 10 was killed before it finished", can probably just say "I'm looking into AERR055". Much like how we use JIRA ticket numbers or GitHub issue/PR numbers.

potiuk commented 6 days ago

:heart: this. This is what many other tools are doing already. And being able to classify and list all the different types of errors that the software can generate, together with explaining their cause and remediations - even just list those - is a sign of high maturity of the software.

potiuk commented 6 days ago

I really like it.

We could finally find a use for AirflowException - so far it was mainly about being a base class for a number of exceptions, but if we add mandatory "error id" to AirflowException and make Airflow Exception abstract, and add handling so that that Error ID is displayed in the logs and maybe also produced as metric (counting the errors) and produce an event in the OTEL trace when they happen, might be really great mechanism to have and to "force" classification of all the errors that we have in Airflow.