apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.48k stars 14.37k forks source link

Airflow keeps on increasing duration of dag if it is paused after it went to running state. #44443

Open Raul824 opened 1 week ago

Raul824 commented 1 week ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

v2.10.1

What happened?

If dags are paused after they start running their duration keeps on increasing. Due to this the Cluster Activity shows top 5 longest dags which were paused and are not even active.

What you think should happen instead?

Pausing a dag should reset the status from running to scheduled or queued or no state. Cluster Activity longest dags should exclude pause dags.

How to reproduce

  1. Trigger a dag
  2. Once it starts running pause the dag.
  3. Check the duration it will keep on increasing even though it is paused.
  4. After few days check the cluster activity and paused dags will eventually come to the top as the long running dag.

Operating System

AKS

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

AKS Airflow 2.10.1

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

xionams commented 6 days ago

@potiuk assign me ✨

eladkal commented 6 days ago

There is a product desicion to make here. I am not convinced current behavior considered a bug.

I think we need to resolve https://github.com/apache/airflow/issues/22006 first

potiuk commented 6 days ago

I think those are unrelated - that one seems clearly as a bug - because paused DAGS that are not running are shown on cluster duration as long running. The #22006 is a completely new feature ("draining" running DAGs) that has very little to do with the cluster activity displaying misleading information (or at least this is how I read it).

eladkal commented 6 days ago

I think those are unrelated - that one seems clearly as a bug - because paused DAGS that are not running are shown on cluster duration as long running.

Is this accurate? While dag is paused task can in fact continue to run. Since pausing doesnt invoke on_kill() remote jobs continue to run. There are use cases of levaring pause for temporary drain. I used it several times and it was great that metrics reflected it.

My point is that Cluster Activity is the symptom not the real problem. We need to decide what is the right overall behavior of paused (which is why I linked to #22006), as a result cluster activity will be fixed accordingly.

potiuk commented 6 days ago

My point is that Cluster Activity is the symptom not the real problem. We need to decide what is the right overall behavior of paused (which is why I linked to https://github.com/apache/airflow/issues/22006), as a result cluster activity will be fixed accordingly.

I think this is not the case at all in this issue. What I understand is that cluster activity shows running time for paused event that are not running any more. This is at least what description of the issue is about.

potiuk commented 6 days ago

But maybe @Raul824 -> maybe you can clarify that?

eladkal commented 5 days ago

I think this is not the case at all in this issue. What I understand is that cluster activity shows running time for paused event that are not running any more. This is at least what description of the issue is about.

I consider this to be a symptom of a larger issue: what it means running + pausing but I get your point. For this specific issue we can just exclude paused runs from the Cluster Activity Top Running view @xionams would you like to raise a PR for this?

potiuk commented 5 days ago

I consider this to be a symptom of a larger issue: what it means running + pausing but I get your point. For this specific issue we can just exclude paused runs from the Cluster Activity Top Running view @xionams would you like to raise a PR for this?

Yeah. I think those two are related but different.

Raul824 commented 5 days ago

Well I noticed this while looking at cluster activity, but isn't paused dags which went to running also using the resources of updating the duration as well.

And pausing the dag doesn't actually kills the process which has started that I have noticed.

So I think, if pausing functionality can be fixed to reset the status of the dag from running to queued or no state that would solve these issues.

As for killing of already running tasks that is provided by marking a task as failed, the functionality of pause is fine at the moment of pausing the dag to not run further which is happening partially as tasks are not starting but dag state is also stuck on running in these cases which I think is a bug.

xionams commented 1 day ago

I think this is not the case at all in this issue. What I understand is that cluster activity shows running time for paused event that are not running any more. This is at least what description of the issue is about.

I consider this to be a symptom of a larger issue: what it means running + pausing but I get your point. For this specific issue we can just exclude paused runs from the Cluster Activity Top Running view @xionams would you like to raise a PR for this?

I'm just into something, if we no need to rush, I will happily raise PR

eladkal commented 1 day ago

I'm just into something, if we no need to rush, I will happily raise PR

Raising a PR to fix the cluster activity (Top 5 running dags) is step forward.