Open Raul824 opened 1 week ago
@potiuk assign me ✨
There is a product desicion to make here. I am not convinced current behavior considered a bug.
I think we need to resolve https://github.com/apache/airflow/issues/22006 first
I think those are unrelated - that one seems clearly as a bug - because paused DAGS that are not running are shown on cluster duration as long running. The #22006 is a completely new feature ("draining" running DAGs) that has very little to do with the cluster activity displaying misleading information (or at least this is how I read it).
I think those are unrelated - that one seems clearly as a bug - because paused DAGS that are not running are shown on cluster duration as long running.
Is this accurate?
While dag is paused task can in fact continue to run. Since pausing doesnt invoke on_kill()
remote jobs continue to run.
There are use cases of levaring pause for temporary drain. I used it several times and it was great that metrics reflected it.
My point is that Cluster Activity is the symptom not the real problem. We need to decide what is the right overall behavior of paused (which is why I linked to #22006), as a result cluster activity will be fixed accordingly.
My point is that Cluster Activity is the symptom not the real problem. We need to decide what is the right overall behavior of paused (which is why I linked to https://github.com/apache/airflow/issues/22006), as a result cluster activity will be fixed accordingly.
I think this is not the case at all in this issue. What I understand is that cluster activity shows running time for paused event that are not running any more. This is at least what description of the issue is about.
But maybe @Raul824 -> maybe you can clarify that?
I think this is not the case at all in this issue. What I understand is that cluster activity shows running time for paused event that are not running any more. This is at least what description of the issue is about.
I consider this to be a symptom of a larger issue: what it means running + pausing but I get your point. For this specific issue we can just exclude paused runs from the Cluster Activity Top Running view @xionams would you like to raise a PR for this?
I consider this to be a symptom of a larger issue: what it means running + pausing but I get your point. For this specific issue we can just exclude paused runs from the Cluster Activity Top Running view @xionams would you like to raise a PR for this?
Yeah. I think those two are related but different.
Well I noticed this while looking at cluster activity, but isn't paused dags which went to running also using the resources of updating the duration as well.
And pausing the dag doesn't actually kills the process which has started that I have noticed.
So I think, if pausing functionality can be fixed to reset the status of the dag from running to queued or no state that would solve these issues.
As for killing of already running tasks that is provided by marking a task as failed, the functionality of pause is fine at the moment of pausing the dag to not run further which is happening partially as tasks are not starting but dag state is also stuck on running in these cases which I think is a bug.
I think this is not the case at all in this issue. What I understand is that cluster activity shows running time for paused event that are not running any more. This is at least what description of the issue is about.
I consider this to be a symptom of a larger issue: what it means running + pausing but I get your point. For this specific issue we can just exclude paused runs from the Cluster Activity Top Running view @xionams would you like to raise a PR for this?
I'm just into something, if we no need to rush, I will happily raise PR
I'm just into something, if we no need to rush, I will happily raise PR
Raising a PR to fix the cluster activity (Top 5 running dags) is step forward.
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
v2.10.1
What happened?
If dags are paused after they start running their duration keeps on increasing. Due to this the Cluster Activity shows top 5 longest dags which were paused and are not even active.
What you think should happen instead?
Pausing a dag should reset the status from running to scheduled or queued or no state. Cluster Activity longest dags should exclude pause dags.
How to reproduce
Operating System
AKS
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
AKS Airflow 2.10.1
Anything else?
No response
Are you willing to submit PR?
Code of Conduct