Handling of the DagRun callback log should be visible on the UI somewhere

wolfier commented 1 year ago

Description

When a DagRun handles its callbacks, the logs are only available in the DAGFileProcessorProcess log location. I want to be able to view these logs in the Airflow UI or at least a subset of the logs generated by the callbacks.

In addition, I would like a surefire way to confirm on the Airflow UI that the scheduler sent the DagCallbackRequest and the DAGFileProcessorProcess handled the callback requqest.

Use case/motivation

I had written a DAG with on_success_callback set and I noticed that certain dagruns did not generate the corresponding on_success_callback in the DAGFileProcessorProcess logs.

Related issues

No response

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

potiuk commented 1 year ago

Sure. Would be interesting feature, for sure. Hopefuly someone will pick it up some day, but this is rather complex'is new feature that will require quite an investment into architectural issues, so honestly speaking I would not have high hopes. But let's see maybe someone would like to implement it.

Adaverse commented 1 year ago

I would like to take this up.

potiuk commented 1 year ago

assigned

Adaverse commented 1 year ago

Did a small PoC on this, and it seems to work.

Created a new handler that saves only callback logs (in a directory structure like logs/callback/dag_run_id/latest_callback_run or we can have logs/callback/dag_run_id/<attempt_number> ) to a separate path from we can serve to the UI. So, the change apart from the handler part is that we direct all the logs from callbacks to the above handler and propagate them to StreamLogWriter.

https://github.com/apache/airflow/blob/09c5114f67f152c44b555ee408d76b26cfe0691f/airflow/models/dag.py#L1382-L1411

So the above changes to -

    def handle_callback(self, dagrun, file_path: str,success=True, reason=None, session=NEW_SESSION):

        ############ initiate logger and set context ############
        callback_logger = logging.getLogger("airflow.callback")
        set_context(callback_logger, file_path)
        ############ initiate logger and set context ############

        callbacks = self.on_success_callback if success else self.on_failure_callback
        if callbacks:
            callbacks = callbacks if isinstance(callbacks, list) else [callbacks]
            tis = dagrun.get_task_instances(session=session)
            ti = tis[-1]  # get first TaskInstance of DagRun
            ti.task = self.get_task(ti.task_id)
            context = ti.get_template_context(session=session)
            context.update({"reason": reason})
            for callback in callbacks:

                ############ Redirect callback logs ############
                with redirect_stdout(StreamLogWriter(callback_logger, logging.INFO)), redirect_stderr(
                    StreamLogWriter(callback_logger, logging.WARN)
                ):
                    try:
                        callback(context)
                    except Exception:
                        self.log.exception("failed to invoke dag state update callback")
                        Stats.incr("dag.callback_exceptions", tags={"dag_id": dagrun.dag_id})

So, since we get the logs of callbacks separately, we can use the same logic to serve it to the UI.

Wanted to get your thoughts on this @potiuk

potiuk commented 1 year ago

Sure - go for it @Adaverse . I think even if for some reason we won't want to merge it quickly, I think it is a good opportunity to do some second-pair-of-eyes-and hands in that part. Recently @dstandish added triggerer logs interleaving with regular logs, and while this is a bit different, it touches all the same parts, and having somene else to try to implement it with Daniel's keen eye overlooking it has a great potential of:

a) improving some parts of it (there are some convoluted conditions that maybe could be improved when they get a set of fresh eyes and hands put on them b) having one more active person to know more details about some intrinsic details of log serving (it ain't easy).

WDYT @dstandish ?

Adaverse commented 1 year ago

Sure @potiuk!

So given that Dag processor can run in standalone mode, these are the below steps I could think of to implement the feature -

we would need a hostname field in DAG as well just like TI.
Serve logs when running in standalone mode. --skip-serve-logs in airflow dag-processor command. Its already being served when dag processor is run by scheduler
Create a handler that would take DagRun as context to extract the relevant information
Modify log_reader which currently is coupled with reading logs of TIs
Expose the above through REST api
Create UI to serve the logs

One side note -

A specific DagRun may run many times, so we need to keep track of how many times it ran and need to show logs of each attempt of the callback. Currently, are we tracking how many times a dag run ran?

Do let me know if I missed anything. It will be highly appreciated.

potiuk commented 1 year ago

I am not sure if I can assess it by looking at the details. but yes is it sounds about right, though it might terribly complicate "serve log" handing (it is already terribly complicated as @dstandish might attest). Each of those steps (for example where to display it in the UI is a separate decision on it's own - like how to make it UX-friendly currently, should we surface it somewhere ?

It seems like awful lot of complexity for very little value, I think we should answer a question - why would someone want to see the callbacllogs in the UI? When you surface task logs, then yeah - it makes sense to see what's going on, but seeing callback logs is marginally useful, I think. But maybe it is worth it when it is paired with another change - related to your question:

A specific DagRun may run many times, so we need to keep track of how many times it ran and need to show logs of each attempt of the callback. Currently, are we tracking how many times a dag run ran?

I believe even today we have a problem tha we do not store hostname of the previous run attempts - only the last one - and it causes problems with retrieving the logs from previious runs if there is no remote logging configured and no log persistence. Maybe that is an opportunity to fix it as part of this issue.

I still would love to hear from @dstandish what he thinks - he has been recently "knee-deep" or rather "waist-deep" in the logging part, and knows a bit of the complexities involved, so maybe the judgement of Daniel on whether it's worth to bring in the callback logs might help with the decisions.

VladimirYushkevich commented 7 months ago

Do we have any news about it?

apache / airflow