Closed dheerajturaga closed 1 month ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
I am not sure whether this is really a bug, might be rather a semantic gap or something that could be (visually) improved to just hide logs when no logs are there.
In Airflow 2.10 the internal semantic was changed to increment the try number of a task only at the time the task is really scheduled/started. As long as it is just there as instance (not scheduled) the try=0 is the sign for it has never tried. In this case the visual improvement might be to hide the log panel as no logs are existing.
Would you like to submit a small improvement?
@jscheffl, thats not really the case. Here the task has started and completed, and the log is captured under attempt-0.log. However, the webserver is trying to look for attempt-1.log it appears that there are cases where the try number isn't incremented on the log side even when the task has started.
du -sh attempt-0.log 64K attempt-0.log
[2024-09-25T14:24:04.772+0000] {taskinstance.py:1402} DEBUG - previous_execution_date was called [2024-09-25T14:24:04.822+0000] {taskinstance.py:1402} DEBUG - previous_execution_date was called [2024-09-25T14:24:04.838+0000] {local_task_job_runner.py:123} INFO - ::group::Pre task execution logs [2024-09-25T14:24:04.854+0000] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: [queued]> [2024-09-25T14:24:04.867+0000] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: scheduled__2024-09-25T05:00:00+00:00 [queued]> [2024-09-25T14:24:04.868+0000] {taskinstance.py:2865} INFO - Starting attempt 0 of 1
Thanks for the response. You say this is intermittently - do you have any means to re-produce this?
let me see if I can isolate this
Might this be a side effect of #39336 @dstandish ?
How was this task launched / created / scheduled?
@dstandish, this is essentially a bash operator The task was triggered on schedule and is part of a task_group. I suspect whats happening here is the log_filename_template
is using attempt={{ try_number }}.log
as its default. It could be that {try_number}
and {ti.try_number}
are possibly diverging. Please review PR #42633
@dstandish, @jscheffl looks like my pr is exposing a bug with #39336
Take a look at https://github.com/apache/airflow/actions/runs/11133289143/job/30939468337?pr=42633
@dstandish, @jscheffl looks like my pr is exposing a bug with #39336
Take a look at https://github.com/apache/airflow/actions/runs/11133289143/job/30939468337?pr=42633
It's certainly possible. But I have not seen this. So can you provide some repro steps? Like, give us the simplest dag possible which also produces this behavior.
@dstandish, Im trying hard to reproduce the conditions as to when this is happening. This happens intermittently. Take a look at this screenshot. This scenario has {try_number}
and {ti.try_number}
different
I added both these templates in the log_filename_template.
log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log
Also, there are many cases where the first task of the dag starts with try_number 2
@dstandish, @jscheffl Here's how you reproduce this. Just run the simple dag below. On completion, clear the first task and let it rerun. You will notice {try_number}
and {ti.try_number}
diverge. {try_number}
accurately changes between 1 and 2 but {ti.try_number}
is stuck at 2.
make sure you set this cfg option:
log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log
from __future__ import annotations
import pendulum
import airflow
from airflow.decorators import dag, task_group
from airflow.operators.bash import BashOperator
from datetime import timedelta
import yaml
import os
@dag(
dag_id="debug_me",
start_date=pendulum.today('America/Chicago'),
catchup=False,
tags=["debug"]
)
def single_hello():
hw = BashOperator(
task_id="try_number_check",
bash_command="""
# echo try_number:
echo ti.try_number: {{ti.try_number}}
"""
)
hw1 = BashOperator(
task_id="try_number_check_1",
bash_command="""
# echo try_number:
echo ti.try_number: {{ti.try_number}}
"""
)
hw >> hw1
single_hello()
I tried reproing on main via clearing task but could not do
https://github.com/user-attachments/assets/685ff05d-2b40-4019-8f3e-fef0da404b5e
Here's my code
with DAG(dag_id="dag1", schedule="@daily", catchup=False, start_date=pendulum.now()) as dag1:
@task(retries=2, retry_delay=timedelta(seconds=10))
def the_task(try_number=None, ti=None):
sleep(rand() * 10)
tasks = []
for num in range(3):
tasks.append(the_task.override(task_id=f"task_{num+1}")())
chain_linear(*tasks)
Can you try it and see if you get a diff result?
@dstandish, you are missing changing attempt-{{ try_number }}.log
to attempt-{{ ti.try_number }}.log
in your airflow.cfg
once you do that, try clicking between the logs of try 1
and try 2
even better would be setting the following in your airflow config:
log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log
@dstandish, Issue is that it happens at random. I have a cluster of celery workers which keep listening for jobs from the scheduler. There maybe a case where the try number isn't incremented even when the job is running on the worker. I can't isolate this with airflow standalone and it seems to be effecting all tasks ( there doesn't seem to be a pattern here ).
Clearing this task increments the try_number to 2
A workaround here would be to find all attempt-0.log and softlink them to attempt-1.log to fix the gui. I have a dag that is doing this every 20min and there are many tasks scattered all over that are randomly impacted by this
@dstandish, I figured out what's happening. I had a few workers which were still on airflow 2.7.2 which never got recycled. the issue went away when I upgraded them to 2.10.2
Thanks for your help on triaging this!
@dstandish, I figured out what's happening. I had a few workers which were still on airflow 2.7.2 which never got recycled. the issue went away when I upgraded them to 2.10.2
Thanks for your help on triaging this!
Awesome, great result
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.10.2
What happened?
Some tasks in airflow 2.10.2 are being launched with try number 0. However, the webserver is looking for logs starting with try number 1. Because of this we are having cases where tasks are running and passing however, the server throws a "cannot read served logs" error.
Executor=CeleryExecutor
What you think should happen instead?
Airflow should start attempt with try number 1 instead.
How to reproduce
This is happening intermittently. Not all tasks start with try number 0. I have checked to make sure both the webserver and the celery worker are on the same version 2.10.2
Operating System
SUSE Linux Enterprise Server 12 SP5
Versions of Apache Airflow Providers
apache-airflow-providers-celery==3.8.1 celery==5.4.0
Deployment
Docker-Compose
Deployment details
No response
Anything else?
No response
Are you willing to submit PR?
Code of Conduct