Closed ibardarov-fms closed 1 year ago
Could you please provide an example of your DAG files? (contains the dag config, like the start date, schedule, catchup...)
Here it is
from datetime import datetime, timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.providers.docker.operators.docker import DockerOperator
from airflow.timetables.trigger import CronTriggerTimetable
from airflow.utils.trigger_rule import TriggerRule
from airflow import DAG
# Equals to the source/target schema
client_name = "apache_airflow"
default_args = {
"owner": "insights",
"depends_on_past": False,
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
def notify_on_dag_failure(context):
dag_run = context.get("dag_run")
client_name = context["params"]["client_name"]
dag_id = dag_run.dag_id
reason = context["reason"]
task_instances = dag_run.get_task_instances()
failed_ti = [ti for ti in task_instances if ti.state == State.FAILED]
failed_ti_ids = ", ".join([ti.task_id for ti in failed_ti])
message = card_json(
"fail",
f"Failed {client_name}",
f"{reason}: {failed_ti_ids}",
client_name,
dag_id,
)
print(message)
with DAG(
f"test_{client_name}",
schedule=CronTriggerTimetable("0 5 * * *", timezone="UTC"), # https://crontab.guru/
default_args=default_args,
catchup=False,
tags=["insights", "dbt", "snowflake", client_name],
on_failure_callback=notify_on_dag_failure,
max_active_runs=1,
max_active_tasks=1,
default_view="grid",
start_date=datetime(2022, 2, 28),
params={
"client_name": client_name,
},
) as dag:
begin = DummyOperator(task_id="begin")
end = DummyOperator(task_id="end", trigger_rule=TriggerRule.NONE_FAILED)
standard_zone = DockerOperator(
dag=dag,
task_id="standard_zone",
pool="snowflake",
doc_md="documentation...",
command=["/hello"],
environment={},
image="hello-world",
api_version="auto",
auto_remove=True,
docker_conn_id="docker_default",
docker_url="unix://var/run/docker.sock",
network_mode="host",
tty=True,
mount_tmp_dir=False,
do_xcom_push=False,
force_pull=True,
retries=3,
max_retry_delay=timedelta(hours=4),
retry_delay=timedelta(minutes=5),
retry_exponential_backoff=True,
)
begin >> standard_zone >> end
There might be a bug in the CronTriggerTimetable
being used. Could you please try using schedule="0 5 * * *"
instead?
Also you can try catchup=True
with CronTriggerTimetable
.
I will enable today the catchup=True to see how it will go, and if it doesn't help, on tomorrow will run with catchup=True and the old syntax for cron.
The catchup=True
helped and all dags are running fine. 2 days in a row.
I see that some of the dags are scheduled 14 seconds after the required time. The same is visible also with the old airflow versions without the catchup=True.
select dag_id, queued_at, start_date from dag_run
WHERE 1=1
and execution_date::date = CURRENT_DATE - INTERVAL '0 day'
ORDER BY queued_at;
dag_id | queued_at | start_date
--------------------------------------------+-------------------------------+-------------------------------
CLIENT_dag_name | 2023-06-21 05:00:00.429688+00 | 2023-06-21 05:00:00.600953+00
CLIENT_dag_name | 2023-06-21 05:00:00.441562+00 | 2023-06-21 05:00:00.60247+00
CLIENT_dag_name | 2023-06-21 05:00:00.457421+00 | 2023-06-21 05:00:00.604609+00
CLIENT_dag_name | 2023-06-21 05:00:00.469522+00 | 2023-06-21 05:00:00.606059+00
CLIENT_dag_name | 2023-06-21 05:00:00.48264+00 | 2023-06-21 05:00:00.612009+00
CLIENT_dag_name | 2023-06-21 05:00:00.493514+00 | 2023-06-21 05:00:00.592331+00
CLIENT_dag_name | 2023-06-21 05:00:00.522506+00 | 2023-06-21 05:00:00.599394+00
CLIENT_dag_name | 2023-06-21 05:00:00.53555+00 | 2023-06-21 05:00:00.60757+00
CLIENT_dag_name | 2023-06-21 05:00:00.550662+00 | 2023-06-21 05:00:00.609064+00
CLIENT_dag_name | 2023-06-21 05:00:00.56266+00 | 2023-06-21 05:00:00.610524+00
CLIENT_dag_name | 2023-06-21 05:00:02.051863+00 | 2023-06-21 05:00:02.228549+00
CLIENT_dag_name | 2023-06-21 05:00:02.064021+00 | 2023-06-21 05:00:02.243757+00
CLIENT_dag_name | 2023-06-21 05:00:02.079336+00 | 2023-06-21 05:00:02.230704+00
CLIENT_dag_name | 2023-06-21 05:00:02.101663+00 | 2023-06-21 05:00:02.235086+00
CLIENT_dag_name | 2023-06-21 05:00:02.115229+00 | 2023-06-21 05:00:02.233271+00
CLIENT_dag_name | 2023-06-21 05:00:02.129137+00 | 2023-06-21 05:00:02.236793+00
CLIENT_dag_name | 2023-06-21 05:00:02.14308+00 | 2023-06-21 05:00:02.238552+00
CLIENT_dag_name | 2023-06-21 05:00:02.156731+00 | 2023-06-21 05:00:02.226757+00
CLIENT_dag_name | 2023-06-21 05:00:02.175163+00 | 2023-06-21 05:00:02.240286+00
CLIENT_dag_name | 2023-06-21 05:00:02.189773+00 | 2023-06-21 05:00:02.242026+00
CLIENT_dag_name | 2023-06-21 05:00:04.341434+00 | 2023-06-21 05:00:04.539629+00
CLIENT_dag_name | 2023-06-21 05:00:04.355116+00 | 2023-06-21 05:00:04.545596+00
CLIENT_dag_name | 2023-06-21 05:00:04.369535+00 | 2023-06-21 05:00:04.547978+00
CLIENT_dag_name | 2023-06-21 05:00:04.383337+00 | 2023-06-21 05:00:04.55093+00
CLIENT_dag_name | 2023-06-21 05:00:04.39816+00 | 2023-06-21 05:00:04.553406+00
CLIENT_dag_name | 2023-06-21 05:00:04.413037+00 | 2023-06-21 05:00:04.558538+00
CLIENT_dag_name | 2023-06-21 05:00:04.437959+00 | 2023-06-21 05:00:04.561113+00
CLIENT_dag_name | 2023-06-21 05:00:04.466875+00 | 2023-06-21 05:00:04.563797+00
CLIENT_dag_name | 2023-06-21 05:00:04.482523+00 | 2023-06-21 05:00:04.566353+00
CLIENT_dag_name | 2023-06-21 05:00:04.500369+00 | 2023-06-21 05:00:04.568963+00
CLIENT_dag_name | 2023-06-21 05:00:14.449285+00 | 2023-06-21 05:00:14.836314+00
CLIENT_dag_name | 2023-06-21 05:00:14.556828+00 | 2023-06-21 05:00:14.841634+00
CLIENT_dag_name | 2023-06-21 05:00:14.633153+00 | 2023-06-21 05:00:14.845604+00
CLIENT_dag_name | 2023-06-21 05:00:14.725238+00 | 2023-06-21 05:00:14.847512+00
(34 rows)
Possibly related to #27399?
In this issue, airflow doesn't create runs for the whole days/runs - it acts like the dag is disabled and skips days/runs.
In this issue, airflow doesn't create runs for the whole days/runs - it acts like the dag is disabled and skips days/runs.
I think that might still be related. Simply some subtle bug (like running the schedule precisely at the very moment it should be scheduled) might trigger it. For some reason you seem to have an installation where this behaviour seems to be easily reproducible, so maybe we can use it to narrow down the issue.
I think @hussein-awala was right it would be great if you could try to reproduce it with old expression and catchup = False. From what I understand above, catchup =True
actully solves the problem. If we could know that also the old schedule does (independently from catchup = True), it could help to narrrow down the issue.
Also cc: @uranusjr -> It really looks like some edge-case i CronTriggerTimetable from the description and helpful experiments done by @ibardarov-fms . The 14 seconds delay in queue time shows that likely there might be a race condition that gets triggered somewhere by the timetable.
cc: @uranusjr @hussein-awala .
I do not have yet the exact scenario in mind but looking at the "catchup" fixing the problem and the code of the triggerer, I have a possible candidate.
One of the best candidates I have this use of the _align_to_next
method as it relies on exact equality AND the fact that it is using utcnow(). It is executed when catchup is False, and since it is using utcnow(), I believe it might be susceptible to behaving wrongly when you are "just before", or "just after" or even "just at" the interval edge.
if restriction.catchup:
if last_automated_data_interval is not None:
next_start_time = self._get_next(last_automated_data_interval.end)
elif restriction.earliest is None:
return None # Don't know where to catch up from, give up.
else:
next_start_time = self._align_to_next(restriction.earliest)
else:
start_time_candidates = [self._align_to_next(DateTime.utcnow())] # !!!!! <--- likely trigger of the problem
if last_automated_data_interval is not None:
start_time_candidates.append(self._get_next(last_automated_data_interval.end))
if restriction.earliest is not None:
start_time_candidates.append(self._align_to_next(restriction.earliest))
next_start_time = max(start_time_candidates)
if restriction.latest is not None and restriction.latest < next_start_time:
return None
return DagRunInfo.interval(next_start_time - self._interval, next_start_time)
Will run it with catchup=False and the old cron.
I am using those environment variables/settings:
AIRFLOW__SCHEDULER_IDLE_SLEEP_TIME=1
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=30
AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL=30
With catchup=False and the old cron way the problem is visible. From 30 dags, there were scheduled only 5. When I restored the new cron syntax with the catchup=True - airflow started and process the not scheduled dags.
I think the two are indeed related. The alignment implementation guess seems plausible, but I’m not able to come up with an example when it actually triggers a problem now (i.e. a failing test case).
I believe #32921 might fix that one. @ibardarov-fms - is it possible that you apply the fix in your installation of Airflow and check it ?
I have run it with the fix and it looks good. but I want to run it on more time without the fix to verify if the tasks will fail again.
The fix is working. I run it two times and all dags were scheduled correctly and run.
I have to create new dags and need to make sure that with the new dags the problem will be visible. The 3rd run was without the fix Now I am sure that the fix is working because without it the new dags were not scheduled and run as described in this ticket.
Apache Airflow version
2.6.1
What happened
We are running 40 dags of the same kind. All are started at 05:00. After upgrade to 2.6.1 sometimes randomly dags are not scheduled and there are no created dag-runs.
What you think should happen instead
I would like to see a green column for the next period. If there is something failing I would expect to see an error or at least warning message somewhere.
How to reproduce
It happens after the upgrade from 2.3.2.
Operating System
Ubuntu
Versions of Apache Airflow Providers
Deployment
Docker-Compose
Deployment details
I ran airflow from docker-compose.
Anything else
When I manually pause and unpause the dag nothing happens. In the audit log there is no information of anyone trying to run the dag. In all the postgres tables there are no created entries/rows for the failing dag for the missing dates. There are no logs created for the missing days. There are no errors in the other log files.
I tried to allocate a lot of memory in a container and it works. I added swap file but it looks it has been never used. The tasks are running dbt
For dag processor I see from time to time some PID
in the scheduler log i see nothing is schedulled at the expected time
Are you willing to submit PR?
Code of Conduct