apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.31k forks source link

post_execute() method does not work correctly when used along with a trigger #41707

Open venkatajagannath opened 2 months ago

venkatajagannath commented 2 months ago

Apache Airflow version

2.10.0

If "Other Airflow 2 version" selected, which one?

No response

What happened?

I have a new provider to send jobs to a Ray cluster. One of the operators has pre_execute and post_execute operations. Here is the complete flow of control

  1. pre_execute() --> Sets up a Ray cluster
  2. execute() --> Triggers a new job on the cluster and defers tracking to a trigger
  3. async trigger --> tracks the job execution and prints logs
  4. execute_complete() --> completes main task execution after the trigger reaches terminal state
  5. post_execute() --> Deletes the Ray cluster

When post_execute() starting executing it seems to be running pre_execute() code again. This is a bug. We should only see post_execute code. Logs attached.

post_execute_bug.txt

What you think should happen instead?

Only code relevant to the post_execute method must execute.

How to reproduce

  1. Install the provider using the instructions mentioned on this github page
  2. Run one of the example DAGs that only uses the SubmitRayJob operator in the DAG

You will need access to a k8 cluster and will need to provide connection details as shown in step 1

Operating System

linux

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 months ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

uranusjr commented 2 months ago

(Copying my message from elsewhere with minor edits for context)

It’s better to use setup and teardown because they handle errors better.

For the reported issue specifically, pre_execute is actually executed every time a worker is started for the task, not only when execute is called, despite the name. (This is also true for on_execution_callback, by the way.) Resuming a deferred task starts a worker and causes pre_execute to be called.

This is kind of by design, but I guess there’s a case to be made it’s not a good design. If we are to change this, I would recommend:

  1. Rename pre_execute and on_execution_callback to remove execute from the name to avoid misunderstandings. Document them well to describe the actual behaviour.
  2. (Maybe?) Add a hook that’s actually only run when the first worker is started for a ti.
  3. (Maybe??) Add hooks that are run when a task is deferred and resumed?
RNHTTR commented 2 months ago

@uranusjr Does it make sense to rename this issue to "Rename pre_execute and on_execution_callback to remove execute from the name to avoid misunderstandings"?

venkatajagannath commented 2 months ago

@uranusjr

pre_execute() and post_execute() naming convention is very intuitive. Would it be possible to change the backend design instead?

I've used setup and teardown. Example here. But, my current usecase is slightly different.

Its a much better UX to use just one decorator to spin up/down a cluster and also execute the main job. See example here. We also hear the same feedback from customers.

One idea:

(Maybe?) we can extend setup/teardown to configure a method within a custom operator as a setup and another as a teardown. These methods will execute similar to setup/teardown in a DAG.

Lee-W commented 2 months ago

@uranusjr Does it make sense to rename this issue to "Rename pre_execute and on_execution_callback to remove execute from the name to avoid misunderstandings"?

I'm not sure whether we already decide to change it this way. Personally, I prefer 2 and 3. 1 might not help much for one who doesn't understand airflow in depth. Providing 2 and 3 will give the user a sense that these things are different

uranusjr commented 2 months ago

I guess the question is, is a hook that runs every time a worker starts useful for people? We need something for that if we change the semantic of pre_execute (and on_execution_hook).

Lee-W commented 2 months ago

I guess the question is, is a hook that runs every time a worker starts useful for people? We need something for that if we change the semantic of pre_execute (and on_execution_hook).

For operators works like S3KeySensor, I think it's possible 🤔