Closed Bowrna closed 7 months ago
I would like to submit a PR, but I want to confirm if this is an actual bug from committers and find out if my understanding is correct.
https://gist.github.com/Bowrna/1994894beea39fa8e1c269591b7f0346#file-airflow_local_settings-py-L120
if task.on_failure_callback.__qualname__ != "add_failure_callback.<locals>.new_callback" and task.on_success_callback.__qualname__ != "task_failure_alert":
Should this be checking task.on_failure_callback.__qualname__ != "task_failure_alert"
instead of task.on_success_callback.__qualname__ != "task_failure_alert"
similar to the condition in add_success_callback
@Bowrna - hard to say - if you are able to reproduce it, then likely it is a bug. Finding a problem and fixing might actually prove it.
https://gist.github.com/Bowrna/1994894beea39fa8e1c269591b7f0346#file-airflow_local_settings-py-L120
@tirkarthi Yes you are right. Thanks for pointing out this mistake. But even fixing it, the issue is not resolved. I have added this check to avoid double loading of same function callback
@potiuk If someone could throw light on the below point it would be useful for me to go and fix this issue:
Once it(callback) is getting executed as part of the taskinstance
callback, while another time it is getting executed as part of DagFileProcessorProcess
Is it expected for callback to be executed from both of these file.
I will go back and debug this further and share more details in this thread.
I will also check if success callback is getting executed via DagFileProcessorProcess
too.
Generally speaking - yes the callback can be executed by both task running (when task CAN execute callback) and by DagFileProcessor (when task callback cannot be executed in task - for example when task was forcefully killed).
That's the theory.
However - this is a distributed system - and there are different modes of execution of such distributed operations: at-most-once
, at-least-once
and exactly-once
.
Explained for example here: https://medium.com/@madhur25/meaning-of-at-least-once-at-most-once-and-exactly-once-delivery-10e477fafe16#:~:text=As%20the%20name%20suggests%2C%20At,exception%2C%20the%20message%20is%20lost.
I believe (this needs to be looked a bit closer as I think it is not explicitly specified - but it would be great to verify it and document at least how the callbacks are delivered. Exactly-once is surprisingly hard to achieve in distributed systems - and I think it's almost impossible when you consider all kind of failure cases.
I believe what our callback mechanism tries to achieve is "at-least-once". I am not 100% sure if this is achieved in all kinds of situations, and I am not sure exactly what's the sequence of events is to lead to having duplicates, but generally speaking I think we cannot guarantee that callback will be executed exactly once.
If this is easily reproducible, then it might a bug because "at-least-once" is only happening when there are some unexpected events happening. If this is happening in "normal" circumstances - this is a bug. If it is triggered by some error scenarios - not so much and likely we won't be able to do anything about it - because it would be to complex and costly to try to implement "exactly-once".
@potiuk I will check if this is getting executed twice in "normal" circumstances / complex error scenarios as you mentioned.
My understanding from reading the documentation was that the failure callback has to be executed in all cases.
Generally speaking - yes the callback can be executed by both task running (when task CAN execute callback) and by DagFileProcessor (when task callback cannot be executed in task - for example when task was forcefully killed).
When the execution is failed during the run, or the task is forcefully killed, I thought a failure callback for the task will be invoked as there is no mention of this one in the documentation. I can verify this part and document it.
They are executed - but not in task because task is not there in this case. Callbacks for tasks that were 'killed' are executed in DagFileProcessor normally. But again - the distribution happens, so it is very likely that there are cases where the callbacks are execute in both - task and DFP because the DFP 'thonks' the callback was not executed in the task..this is one of the cases where callbacks might be executed more than once. And if the DFP is killed after callback is executed so that it cannot record this, it can be executed again most likely. This is the 'at least once' semantics
def fail_case():
raise AirflowFailException('Failure case test for cluster policy')
failure_test = PythonOperator(
task_id='failure_case',
python_callable=fail_case,
# on_failure_callback=on_failure_callback,
)
In this case, this is how I have made the task fail. I have added an exception to fail it and task callback is invoked. DagFileProcessor is also invoked in this case.
And if the DFP is killed after callback is executed so that it cannot record this, it can be executed again most likely. This is the 'at least once' semantics
I don't understand this point of yours. Could you help me to understand it? Do you mean to tell it can be executed more than twice too?
I don't understand this point of yours. Could you help me to understand it? Do you mean to tell it can be executed more than twice too?
Potentially yes. Not sure if in our case though.
I see. then the callback function has to be written to handle this case in such a way that the duplicate is handled.
The problem with this double callback is the end_date of the failure task is not reliable as we can't determine which is the right callback for the failure case and can't determine the end_date of the failed task
This is why we introduced listener API - that should be one more reason to go to 2.5.0. It's really counter-productive to make current version implement some workarounds for that - much better is to spend effort on migrating and using the new API :). That's my honest advise.
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
Apache Airflow version
2.4.2
What happened
I have applied task callback for failure and success case using Cluster policy for the specific dag https://gist.github.com/Bowrna/7fdca8b546fc274edd068ffdae5b76f9
I am attaching the cluster policy that I have applied here. https://gist.github.com/Bowrna/1994894beea39fa8e1c269591b7f0346
On executing the DAG, the success callback is correctly invoked once for every successful task, while failure callback is invoked twice for a failure in task
What you think should happen instead
Like success callback, failure callback also should get executed only once.
How to reproduce
I have attached the sample DAG and airflow_local_settings.py file in which i have added the cluster policy used. On running airflow with that and executing the DAG either manually/scheduled will cause to log the below details
Here you can see for the failure case, it is logged twice. Also both of these failure case logging have different duration(I have verified if i am double logging in code by mistake, but that is not the case)
Operating System
macOS Monterey version 12.2.1
Versions of Apache Airflow Providers
apache-airflow-providers-apache-hive==4.0.1 apache-airflow-providers-cncf-kubernetes==4.4.0 apache-airflow-providers-common-sql==1.2.0 apache-airflow-providers-ftp==3.1.0 apache-airflow-providers-http==4.0.0 apache-airflow-providers-imap==3.0.0 apache-airflow-providers-sqlite==3.2.1
Deployment
Virtualenv installation
Deployment details
I am currently running the airflow in development mode and testing callback and cluster policy
Anything else
Once it is getting executed as part of the
taskinstance
callback, while another time it is getting executed as part ofDagFileProcessorProcess
I pulled out the logs from the specific task that is getting logged twice and this is part oftaskinstance.py
flowI found the below log in the task of the specific dag_run
Also I found another log in
airflow scheduler
This also logs the failure case that gets executed from processor.py flow of the code
Are you willing to submit PR?
Code of Conduct