apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.03k stars 14.28k forks source link

Cron schedule and Time Zones leads to incorrect intervals #29037

Closed schwartzpub closed 1 year ago

schwartzpub commented 1 year ago

Apache Airflow version

2.5.0

What happened

When using cron syntax for DAG schedule, the scheduler is not running DAGs at the correct time for my timezone. For instance, a DAG that should run at 6:00am is running at 12:00am as though the scheduler believes the system time is set for UTC.

airflow.cfg default_timezone = America/Chicago

Ubuntu 20.04LTS

user@host:~$ sudo timedatectl
    Local time: Thu 2023-01-19 06:44:37 CST
    Universal time: Thu 2023-01-19 12:44:37 UTC
    RTC time: Thu 2023-01-19 12:44:36
    Time zone: America/Chicago (CST, -0600)
System clock synchronized: yes
    NTP service: active
    RTC in local TZ: no

dag.py

...
    schedule="*/15 6-17 * * 1-5",
    start_date=pendulum.datetime(2023,1,17,tz='America/Chicago'),
    catchup=True,
...

When checking the DAG details in the UI, I see this which leads me to believe something is converting my schedule to UTC when the DAG is imported: next_dagrun_data_interval = DataInterval(start=DateTime(2023, 1, 19, 6, 45, 0, tzinfo=Timezone('UTC')), end=DateTime(2023, 1, 19, 7, 0, 0, tzinfo=Timezone('UTC')))

What you think should happen instead

The DAG should run every 15min between 6a-6p CST in respect to the system, Airflow, and DAG timezone configuration.

How to reproduce

No response

Operating System

Ubuntu 20.04 LTS

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.1.0 apache-airflow-providers-common-sql==1.3.2 apache-airflow-providers-ftp==3.3.0 apache-airflow-providers-http==4.1.0 apache-airflow-providers-imap==3.1.1 apache-airflow-providers-microsoft-mssql==3.3.2 apache-airflow-providers-sqlite==3.3.1

Deployment

Other

Deployment details

Manual install of apache-airflow using pip.

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

schwartzpub commented 1 year ago

Looking through the documentation I see that it is recommended to keep the airflow cfg at UTC. Having set this back to UTC and reimporting the DAG which has been made TZ aware pendulum.datetime(2023,1,19,6,tz="America/Chicago") the DAG is still confused about when it should run. With the UI set to CST (-06:00) it shows the Next Run is 6 hours ago, and is not scheduling them correctly according to the cron syntax provided. It is still running the tasks 6 hours later than expected, for example the 6:15a run is happening at 12:15p CST.

notatallshaw-gts commented 1 year ago

Be aware that the "Last Run" and "Next Run" are for Airflow's logical_date/execution_date which can be a little confusing to interpret at times: https://airflow.apache.org/docs/apache-airflow/stable/faq.html#what-does-execution-date-mean

They are not the expected wall time for the DAG to kick off.

FYI I use a non-UTC timezone for default config, UI, and DAG start_date, and do not have any issues. I believe the line about using UTC in the config was written pre-Airflow 1.10 when the timezone support was very poor.

Other than that I don't think there are enough details to assist you, you should provide a simple example DAG with the airflow cfg changes and UI changes you have made, and a screenshot of what you think is incorrect.

schwartzpub commented 1 year ago

Last run/next run aside -- what other details are needed that aren't provided above? For reference, the documentation for 2.5 is where the recommendation for default_timezone = UTC came from.

The dag definition is as follows:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.models import Variable

from airflow.operators.bash import BashOperator

import pendulum

with DAG(
    "test_dataflow",
    default_args = {
        "depends_on_past": False,
        "email": ["test@test.com"],
        "email_on_failure": False,
    },
    description="Test Dataflow",
    schedule="*/15 6-17 * * 1-5",
    start_date=pendulum.datetime(2023,1,19,6,tz='America/Chicago'),
    catchup=False,
) as dag:
    ssis_p = Variable.get("ssis_password")
    bash_comm = "/opt/ssis/bin/dtexec /f /home/airflow/airflow/ssis/Package.dtsx /de {0} /l 'DTS.LogProviderTextFile;ssis.txt'".format(ssis_p)
    t1 = BashOperator(
        task_id="ssis_dataflow",
        bash_command=bash_comm
    )

    t1

The local server is set to CST

user@host:~$ sudo timedatectl
    Local time: Thu 2023-01-19 06:44:37 CST
    Universal time: Thu 2023-01-19 12:44:37 UTC
    RTC time: Thu 2023-01-19 12:44:36
    Time zone: America/Chicago (CST, -0600)
System clock synchronized: yes
    NTP service: active
    RTC in local TZ: no

The airflow.cfg is currently set to UTC but changing to America/Chicago and restarting the schedulre and webserver services doesn't change the behavior:

    default_timezone = utc

I cannot provide a screenshot since the run times in the UI are not a good indicator (if there is a screenshot that can show this, I can certainly provide one), but given the above configurations I would expect the first run each weekday to happen at 6:15a CST. Instead, the first run of the day happens at 12:15pm CST. When I check the DAG in the morning(s) and through to the afternoon there are no new runs until 12:15pm CST.

If there is any other information I can provide that might be missing here, please let me know so I can provide it.

schwartzpub commented 1 year ago

This is a screenshot of all the DAG runs from today so far.

image

I still don't fully understand the Logical Date, and I still don't understand what would cause this to start the daily DAG runs at 12:15pCST instead of 06:00aCST.

Something else interesting is the queued_at, start_date, end_date for the DAG runs in the database, where they are showing 6p UTC and later, which again doesn't make sense if the expected intervals are UTC-6.

image

Taragolis commented 1 year ago

Looks like it's more discussion rather than Issue. Converted