apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
35.68k stars 13.9k forks source link

Failed to fetch log file from worker. 404 Client Error: NOT FOUND #30743

Open Gollum999 opened 1 year ago

Gollum999 commented 1 year ago

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

If a task gets scheduled on a different worker host from a previous run, logs from that previous run will be unavailable from the webserver.

nkcairflow02.tradebot.com
*** Log file does not exist: /opt/tradebot/logs/airflow/dag_id=cluster_wiretap_download_mnj/run_id=scheduled__2023-04-18T05:00:00+00:00/task_id=chunk_1200_1230.download_chunk_1200_1230/attempt=2.log
*** Fetching from: http://nkcairflow02.tradebot.com:8793/log/dag_id=cluster_wiretap_download_mnj/run_id=scheduled__2023-04-18T05:00:00+00:00/task_id=chunk_1200_1230.download_chunk_1200_1230/attempt=2.log
*** Failed to fetch log file from worker. 404 Client Error: NOT FOUND for url: http://nkcairflow02.tradebot.com:8793/log/dag_id=cluster_wiretap_download_mnj/run_id=scheduled__2023-04-18T05:00:00+00:00/task_id=chunk_1200_1230.download_chunk_1200_1230/attempt=2.log
For more information check: https://httpstatuses.com/404

From my observation, it's pretty clear that this is happening because Airflow is looking for logs at the most recent hostname, not at the hostname that hosted the historical run. In the example above, attempt 2 actually ran on nkcairflow06, but the webserver is trying to fetch logs from nkcairflow02 because that is where attempt 3 ran. You can watch this happen "live" with the example DAG below - each time the task retries on a different worker, the URL that it attempts to use for historical logs changes to use the most recent hostname.

What you think should happen instead

All previous logs should be available from the webserver (as long as the log file still exists on disk).

How to reproduce

#!/usr/bin/env python3
import datetime
import logging
import socket

from airflow.decorators import dag, task

logger = logging.getLogger(__name__)

@dag(
    schedule_interval='@daily',
    start_date=datetime.datetime(2023, 4, 19),
    default_args={
        'retries': 9,
        'retry_delay': 10.0,
    },
)
def test_logs():
    @task
    def log():
        logger.info(f'Running from {socket.gethostname()}')
        raise RuntimeError('force fail')

    log()

test_logs()

Operating System

CentOS Stream 8

Versions of Apache Airflow Providers

N/A

Deployment

Other

Deployment details

Airflow 2.5.1 Self-hosted Postgres DB backend CeleryExecutor

Anything else

Related issues:

Looks like #23178 attempted to fix the issue, but it never got merged.

Are you willing to submit PR?

Code of Conduct

potiuk commented 1 year ago

Yes. I think this one of the problems that could be addressed by #30669 if we decide to add a separate DB entity to keep try_nums - @uranusjr @hussein-awala -> if we move hostname to the new entity, this exact problem would be fixed

jose-l-goncalves-alb commented 4 months ago

This error still happen at 2.7.3 version

potiuk commented 4 months ago

Of course. It's not fixed. But if you would like to help fixing it, it's up for grabs and working on it @jose-l-goncalves-alb . Would you like to take astab on it. The linked PR shows the direction so the best way to help to solve it is to fix it. Also maybe you or your company would like to dedicate an engineer to work o it ?