apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.31k forks source link

Try number inconsistency between webserver and the actual log generated #42549

Closed dheerajturaga closed 1 month ago

dheerajturaga commented 1 month ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.2

What happened?

Some tasks in airflow 2.10.2 are being launched with try number 0. However, the webserver is looking for logs starting with try number 1. Because of this we are having cases where tasks are running and passing however, the server throws a "cannot read served logs" error.

Executor=CeleryExecutor

try number issue

What you think should happen instead?

Airflow should start attempt with try number 1 instead.

How to reproduce

This is happening intermittently. Not all tasks start with try number 0. I have checked to make sure both the webserver and the celery worker are on the same version 2.10.2

Operating System

SUSE Linux Enterprise Server 12 SP5

Versions of Apache Airflow Providers

apache-airflow-providers-celery==3.8.1 celery==5.4.0

Deployment

Docker-Compose

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

jscheffl commented 1 month ago

I am not sure whether this is really a bug, might be rather a semantic gap or something that could be (visually) improved to just hide logs when no logs are there.

In Airflow 2.10 the internal semantic was changed to increment the try number of a task only at the time the task is really scheduled/started. As long as it is just there as instance (not scheduled) the try=0 is the sign for it has never tried. In this case the visual improvement might be to hide the log panel as no logs are existing.

Would you like to submit a small improvement?

dheerajturaga commented 1 month ago

@jscheffl, thats not really the case. Here the task has started and completed, and the log is captured under attempt-0.log. However, the webserver is trying to look for attempt-1.log it appears that there are cases where the try number isn't incremented on the log side even when the task has started.

du -sh attempt-0.log 64K attempt-0.log

[2024-09-25T14:24:04.772+0000] {taskinstance.py:1402} DEBUG - previous_execution_date was called [2024-09-25T14:24:04.822+0000] {taskinstance.py:1402} DEBUG - previous_execution_date was called [2024-09-25T14:24:04.838+0000] {local_task_job_runner.py:123} INFO - ::group::Pre task execution logs [2024-09-25T14:24:04.854+0000] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: [queued]> [2024-09-25T14:24:04.867+0000] {taskinstance.py:2612} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: scheduled__2024-09-25T05:00:00+00:00 [queued]> [2024-09-25T14:24:04.868+0000] {taskinstance.py:2865} INFO - Starting attempt 0 of 1

jscheffl commented 1 month ago

Thanks for the response. You say this is intermittently - do you have any means to re-produce this?

dheerajturaga commented 1 month ago

let me see if I can isolate this

jscheffl commented 1 month ago

Might this be a side effect of #39336 @dstandish ?

dstandish commented 1 month ago

How was this task launched / created / scheduled?

dheerajturaga commented 1 month ago

@dstandish, this is essentially a bash operator The task was triggered on schedule and is part of a task_group. I suspect whats happening here is the log_filename_template is using attempt={{ try_number }}.log as its default. It could be that {try_number} and {ti.try_number} are possibly diverging. Please review PR #42633

dheerajturaga commented 1 month ago

@dstandish, @jscheffl looks like my pr is exposing a bug with #39336

Take a look at https://github.com/apache/airflow/actions/runs/11133289143/job/30939468337?pr=42633

dstandish commented 1 month ago

@dstandish, @jscheffl looks like my pr is exposing a bug with #39336

Take a look at https://github.com/apache/airflow/actions/runs/11133289143/job/30939468337?pr=42633

It's certainly possible. But I have not seen this. So can you provide some repro steps? Like, give us the simplest dag possible which also produces this behavior.

dheerajturaga commented 1 month ago

@dstandish, Im trying hard to reproduce the conditions as to when this is happening. This happens intermittently. Take a look at this screenshot. This scenario has {try_number} and {ti.try_number} different

I added both these templates in the log_filename_template.

log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log

image

Also, there are many cases where the first task of the dag starts with try_number 2

dheerajturaga commented 1 month ago

@dstandish, @jscheffl Here's how you reproduce this. Just run the simple dag below. On completion, clear the first task and let it rerun. You will notice {try_number} and {ti.try_number} diverge. {try_number} accurately changes between 1 and 2 but {ti.try_number} is stuck at 2.

make sure you set this cfg option: log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log

from __future__ import annotations
import pendulum

import airflow
from airflow.decorators import dag, task_group
from airflow.operators.bash import BashOperator
from datetime import timedelta

import yaml
import os

@dag(
    dag_id="debug_me",
    start_date=pendulum.today('America/Chicago'),
    catchup=False,
    tags=["debug"]
)
def single_hello():
    hw = BashOperator(
            task_id="try_number_check",
            bash_command="""
            # echo try_number:
            echo ti.try_number: {{ti.try_number}}
            """
        )
    hw1 = BashOperator(
            task_id="try_number_check_1",
            bash_command="""
            # echo try_number:
            echo ti.try_number: {{ti.try_number}}
            """
        )

    hw >> hw1

single_hello()
dstandish commented 1 month ago

I tried reproing on main via clearing task but could not do

https://github.com/user-attachments/assets/685ff05d-2b40-4019-8f3e-fef0da404b5e

Here's my code

with DAG(dag_id="dag1", schedule="@daily", catchup=False, start_date=pendulum.now()) as dag1:

    @task(retries=2, retry_delay=timedelta(seconds=10))
    def the_task(try_number=None, ti=None):
        sleep(rand() * 10)

    tasks = []
    for num in range(3):
        tasks.append(the_task.override(task_id=f"task_{num+1}")())

    chain_linear(*tasks)

Can you try it and see if you get a diff result?

dheerajturaga commented 1 month ago

@dstandish, you are missing changing attempt-{{ try_number }}.log to attempt-{{ ti.try_number }}.log in your airflow.cfg

once you do that, try clicking between the logs of try 1 and try 2

even better would be setting the following in your airflow config:

log_filename_template = dag-{{ ti.dag_id }}/run-{{ ts_nodash }}/task-{{ ti.task_id }}/{%% if ti.map_index >= 0 %%}mapid-{{ ti.map_index }}/{%% endif %%}attempt-{{ try_number }}-{{ ti.try_number }}.log

dheerajturaga commented 1 month ago

@dstandish, Issue is that it happens at random. I have a cluster of celery workers which keep listening for jobs from the scheduler. There maybe a case where the try number isn't incremented even when the job is running on the worker. I can't isolate this with airflow standalone and it seems to be effecting all tasks ( there doesn't seem to be a pattern here ).

image

Clearing this task increments the try_number to 2

image

A workaround here would be to find all attempt-0.log and softlink them to attempt-1.log to fix the gui. I have a dag that is doing this every 20min and there are many tasks scattered all over that are randomly impacted by this

dheerajturaga commented 1 month ago

@dstandish, I figured out what's happening. I had a few workers which were still on airflow 2.7.2 which never got recycled. the issue went away when I upgraded them to 2.10.2

Thanks for your help on triaging this!

dstandish commented 1 month ago

@dstandish, I figured out what's happening. I had a few workers which were still on airflow 2.7.2 which never got recycled. the issue went away when I upgraded them to 2.10.2

Thanks for your help on triaging this!

Awesome, great result