Open omkar-foss opened 2 weeks ago
I can recommend this guide from Google about writing good error messages: https://developers.google.com/tech-writing/error-messages. The rest of the courses in that book are also really good btw.
an error like
Celery command failed on host
can be transformed or displayed with something like "Please check your DAG processor timeout variable for this
".
Actionable errors are good, but has to be done very carefully, because if it gives misleading advice it will lead users down chasing the wrong rabbit hole. For example this log in standard_task_runner.py
is most of the time not due to memory running out: "Job %s was killed before it finished (likely due to running out of memory)",
. I've seen our engineers chasing memory issues in vain countless of times because of that message. (yes we should have filed a PR :smile:)
but has to be done very carefully, because if it gives misleading advice it will lead users down chasing the wrong rabbit hole. For example this log in standard_task_runner.py is most of the time not due to memory running out: "Job %s was killed before it finished (likely due to running out of memory)",. I've seen our engineers chasing memory issues in vain countless of times because of that message.
I am big fan of "always tell the user what action from their side the error implies.". Agree things can be misleading and re the case you mentioned - I cannot find it now (I think I discussed it in the past), but I think in case of such complicated and multi-possible-root-cause we should explain what's going on and link to a FAQ page on Airflow explaining possible reasons. This way when you have the error, and we find other reasons and more detailed explanations what could be wrong and how to remediate it - we can always update the docs and add more information that will be useful for many past versions of airflow that people will have.
(yes we should have filed a PR 😄)
Absolutely :)
Have a suggestion for multi-possible-root-cause issues - we can print Airflow error code with the error message e.g. AERR055: Job 10 was killed before it finished
and can have an error code mapping with possible root causes like (just examples, not real causes):
Error Code | Possible Commonly Observed Causes |
---|---|
AERR055 | 1) Ran out of memory |
2) Job was stuck and killed after timeout | |
3) Job being run on Spot Instance Node (K8S on EKS) |
Since error codes are shareable and easily searchable, it would be useful for team collaboration as well (e.g. instead of me saying "I'm looking into the error Job 10 was killed before it finished
", can probably just say "I'm looking into AERR055". Much like how we use JIRA ticket numbers or GitHub issue/PR numbers.
:heart: this. This is what many other tools are doing already. And being able to classify and list all the different types of errors that the software can generate, together with explaining their cause and remediations - even just list those - is a sign of high maturity of the software.
I really like it.
We could finally find a use for AirflowException - so far it was mainly about being a base class for a number of exceptions, but if we add mandatory "error id" to AirflowException and make Airflow Exception abstract, and add handling so that that Error ID is displayed in the logs and maybe also produced as metric (counting the errors) and produce an event in the OTEL trace when they happen, might be really great mechanism to have and to "force" classification of all the errors that we have in Airflow.
Description
As per users' feedback in the Airflow Debugging Survey 2024, around 41.7% respondents don't consider error messages as actionable. Overall feedback also suggests that users find some error messages vague and confusing.
Use case/motivation
Goals for this issue are the following:
Celery command failed on host
can be transformed or displayed with something like "Please check your DAG processor timeout variable for this". So the user has a starting point to start debugging.Related issues
Parent Issue: https://github.com/apache/airflow/issues/40975
Are you willing to submit a PR?
Code of Conduct