apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
35.35k stars 13.8k forks source link

Last automated data interval is always available in custom timetable #27672

Open mrn-aglic opened 1 year ago

mrn-aglic commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

I'm writing an example custom timetable. Implemented next_dagrun_info. From the docs and examples the parameter last_automated_data_interval should be None if there are no previous runs.

However, when I start up the example:

  1. I can confirm that the table dag_run is empty.
  2. when starting (unpausing the DAG) for the first time, the last_automated_data_interval is a data interval and not None as specified by documentation.

This raises the question of how to determine the first DAG run (probably could subtract the DataInterval start and start_date from the DAG (if possible).

Here is an example from the logs: airflow-feat-scheduler | [2022-11-14 19:57:58,934] {WorkDayTimetable.py:28} INFO - last_automated_data_interval: DataInterval(start=DateTime(2022, 11, 10, 0, 0, 0, tzinfo=Timezone('UTC')), end=DateTime(2022, 11, 11, 0, 0, 0, tzinfo=Timezone('UTC')))

I'm using Airflow 2.4.2.

What you think should happen instead

The value of the parameter should be None as specified in the docs.

How to reproduce

Should be reproducible by running the example given in the docs and logging the value of the parameter: last_automated_data_interval. Should appear in the logs.

Operating System

macOs Ventura

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

No response

Anything else

The problem occurs every time.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

uranusjr commented 1 year ago

The first ever run of a DAG is generally calculated when the DAG is detected and pushed into the system by the DAG parser, before a DAG run is ever created. You should be able to see these fields in the dag table are already set at this point: next_dagrun, next_dagrun_data_interval_start, and next_dagrun_data_interval_end. The log you see is the scheduler retriving values for the second run (and storing them in dag), after the first run is created from the aforementioned fields.

mrn-aglic commented 1 year ago

@uranusjr ok, thanks for the clarification. This could be useful to add to the docs. While on topic, can we see the logs for the first dag run calculation anywhere?

uranusjr commented 1 year ago

can we see the logs for the first dag run calculation anywhere?

It kind of depends. I believe the DAG parser is a part of the scheduler process, so it should be somewhere in the scheduler logs. But if you add the DAG very early, it might be buried somewhere in Airflow startup and becomes unvisible (due to log config). Honestly there are many parts of how Airflow log things that’s unclear to me as well.

This could be useful to add to the docs.

An entry in under Concepts that goes through how a DAG run is created would likely be a good idea. Would you be interested in helping out with that?

mrn-aglic commented 1 year ago

yeah sure, I could probably do it in the next couple of days.

vijayasarathib commented 1 year ago

@mrn-aglic and @eladkal - let me know if you want me to help here and close this.