Closed hliu47 closed 1 year ago
Thanks for opening your first issue here! Be sure to follow the issue template!
Since you know what caused it maybe you would like to provide a fix for it @hliu47 ? Sounds like nice contribution back for the free project and you cn ebecome one of the > 2100 contributors to it.
Since you know what caused it maybe you would like to provide a fix for it @hliu47 ? Sounds like nice contribution back for the free project and you cn ebecome one of the > 2100 contributors to it.
@potiuk Thank you so much for replying in this issue. I think there is already a PR in the main branch https://github.com/apache/airflow/pull/21877 that fixed the issue. So is there a timeline for getting that PR into a release?
Interesting - that should get to 2.4.0 release, initial goal is to get it out mid-August, but we releas when things are ready so it depends on testing phase.
That's good to know, thanks!
@hliu47 can you confirm if this is fixed in 2.4.0
This issue has been automatically marked as stale because it has been open for 30 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.
This issue has been closed because it has not received response from the issue author.
@potiuk Hello, do you know if this has been resolved or not, there was some indication that it was destined for 2.4 but then milestone has increased to 2.5.1 and then the issue closed by github-actions. Many thanks
As anyone else here @ed-sparkes I can only see the same sources as you. If it has been closed without being explicitly mentioned, the jury is not out. It may or may not have been solved. But if you would like to verify and reproduce it and see if it has been solved - feel free and then either comment here with evidences it has not been solved (or better open a new issue with reproducible details).
What is the reason you want to know if it's been fixed? Do you base some business decisions on ? Is it causing some issues for you? Or is it just curiosity @ed-sparkes ?
From what I see we believed it's been fixed but the original author did not confirm it, so closing it was pretty good idea. Now - if you would like to help and get more certainty - you could take over where the author left it and perform the verification - that would be great contribution and giving back for the free software you get :)
This bug is still present on Airflow 2.5.1 by my experience. on_failure_callback
works when the task fails "normally", but not when I mark it as failed manually.
This issue should probably be reopened?
I think if If you can provide reproducible case with logs @seub from the last Airflow version, that would be good new issue. I guess the case is different than explained originally in the issue and having a reproducible case with logs would be helpful
@potiuk It's the same issue, and it's easy to reproduce: just interrupt a task manually (in the UI) and observe that on_failure_callback
is not called. Please try and let me know if it's working for you or not.
@seub - please provide a reproducible case and logs if you would like to report it. It might be something specific for your case, and to be brutally honest it's much easier for a reporter of one issue to provide the evidence rather than maintainers who look at 100s of issues and PRs a day to try to reproduce all issues that someone claims is happening without showing their evidences.
Please be empathetic towards maintainers who have a lot of things to handle. If you report "it does not work for me" for an issue that is already closed, you will nearly universaly be asked to report the issue and provide new evidences in a new issue.
@potiuk This issue was automatically closed by a bot without being solved. I am confirming that it is in fact not solved as of 2.5.1, having the exact same behavior. I'm only reporting it out of courtesy, it doesn't matter to me.
It was closed, because we did not have enough evidences and could not reproduce it. And it's really strange to be sure that it is the same bug. It might look and behave similarly and might be related to your particular deployment - Airflow has about 8 executors, and can run in multiple deployments. What you observe in your deployment might have completely different root cause and can be triggered by a number of factors. You are asking others to guess the environment you have and guess how your DAG look like and how to trigger it.
You forget the point that people here (including maitnainers) try to help and investigate user's issues in their free time usually and you got the software absolutely for free. So what we are asking for, is to spend a little time to help those people to not to spend hours and hours of their personal time to investigate and guess something that you are able to reproduce locally and provide necessary information to cut that time by many hours of trying to guess your setting.
We are just asking for your (little time) to fill in the information so that you could help ot save hours of time of those people who spend their own personal time to try to help users like you. But, maybe it is too much of an ask, I don't know. Hard to say, for sure I would be sensitive and empathetic towards those other humans if I were asked to do it politely. But not everyone is.
We can solve this by running the callback in the signal handler:
diff --git a/airflow/models/taskinstance.py b/airflow/models/taskinstance.py
index c74ff8b6b0..0c361d0d41 100644
--- a/airflow/models/taskinstance.py
+++ b/airflow/models/taskinstance.py
@@ -1528,7 +1528,8 @@ class TaskInstance(Base, LoggingMixin):
return
self.log.error("Received SIGTERM. Terminating subprocesses.")
self.task.on_kill()
- raise AirflowException("Task received SIGTERM signal")
+ if self.task.on_failure_callback:
+ self._run_finished_callback(self.task.on_failure_callback, context, "on_failure")
signal.signal(signal.SIGTERM, signal_handler)
The raised exception is somehow lost and I think the loss is similar to the explanations in this blog: https://anonbadger.wordpress.com/2018/12/15/python-signal-handlers-and-exceptions/
Generally running any serious amount of code inside a signal handler should be avoided -- it can cause lots of odd/hard to diagnose behavoiur. (The fact that we already call task.on_kill()
from in there is probably not good either.) so it might fix it, but I'm worried about unpredictable side-effects with bigger failure callbacks.
The way around that in general terms (not sure it applies here) is to have the signal handler set a flag (self.terminate_requested=True
for example) and then have the "loop" (where ever it is) notice that and run the teardown code.
However I'm not sure if that approach applies here. I'm also not sure which process we are in here, is this the raw/actual task executiuon process, or the supervisor?
However I'm not sure if that approach applies here. I'm also not sure which process we are in here, is this the raw/actual task executiuon process, or the supervisor?
It's the raw task execution process.
The way around that in general terms (not sure it applies here) is to have the signal handler set a flag (self.terminate_requested=True for example) and then have the "loop" (where ever it is) notice that and run the teardown code.
I will see if there's something I can do here
Is an option might be to throw a (new/custom) exception in the signal handler, and then catch it in BaseOperator.execute
and run the on_failure callback there?
Is an option might be to throw a (new/custom) exception in the signal handler, and then catch it in
BaseOperator.execute
and run the on_failure callback there?
It currently raises AirflowException which was supposed to be caught by _run_raw_task
method but it doesn't catch it.
I'm getting the same error on Airflow 2.7.1. I saw that the code change to call the on_failure_callback in this cases is no longer present on version 2.7.1.
As you can see from the screenshots bellow the function was never called, the logs are never printed. Can anyone help me with this? Should I open a new issue?
Apache Airflow version
2.2.5
What happened
on_failure_callback
is not called when task is terminated externally. A similar issue was reported in #14422 and fixed in #15172. However, the code that fixed this was changed in a later PR #16301, after whichtask_instance._run_finished_callback
is no longer called when SIGTERM is received (https://github.com/apache/airflow/pull/16301/files#diff-d80fa918cc75c4d6aa582d5e29eeb812ba21371d6977fde45a4749668b79a515L85).What you think should happen instead
on_failure_callback
should be called when task fails regardless of how the task fails.How to reproduce
DAG file:
While the task instance is running, either force quitting the scheduler or manually updating its state to None in the database will cause the task to get SIGTERM and terminate. In either case, a failure callback will not be called which does not match the behavior of previous versions of Airflow.
The stack trace is attached below and
on_failure_callback
is not called.Operating System
CentOS Linux 7
Deployment
Other Docker-based deployment
Anything else
This is an issue in 2.2.5. However, I notice that it appears to be fixed in the main branch by PR #21877 although it was not intended to fix this issue. Is there a timeline for getting that PR into a release? We are happy to test it out to see if it fixes the issue once it's released.
Are you willing to submit PR?
Code of Conduct