apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.08k stars 14.29k forks source link

airflowignore is not hiding DAGs #42476

Open progressive-scaler opened 1 month ago

progressive-scaler commented 1 month ago

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.2

What happened?

After upgrading from 2.10.0 to 2.10.2 in order to correct the issue related to #41699, the airflow is not hiding my legacy pipelines.

Inside dags/ folder, i have a subfolder called archived/, in which i have pipelines that i do not use anymore. These pipelines have the suffix _legacy. Inside the archived/ there is the .airflowignore file with the _legacy keyword, in order to ignore the pipelines in which the filename ends in _legacy.

After the upgrade, it seems that this file is not working since no pipeline in which the filename ends in _legacy are hidden.

I tested moving .airflowignore to the dags/ folder, and appended the keyword archived/, in order to hide all the pipelines on that folder. Still no legacy pipelines were hidden.

Can you help me check what is going on here?

What you think should happen instead?

The pipelines inside archived/ folder or the pipelines that contains _legacy.py in the filename should not appear on airflow dags list.

How to reproduce

Operating System

Debian GNU/Linux 12

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.28.0 apache-airflow-providers-celery==3.8.1 apache-airflow-providers-cncf-kubernetes==8.4.1 apache-airflow-providers-common-compat==1.2.0 apache-airflow-providers-common-io==1.4.0 apache-airflow-providers-common-sql==1.16.0 apache-airflow-providers-docker==3.13.0 apache-airflow-providers-elasticsearch==5.5.0 apache-airflow-providers-fab==1.3.0 apache-airflow-providers-ftp==3.11.0 apache-airflow-providers-google==10.23.0 apache-airflow-providers-grpc==3.6.0 apache-airflow-providers-hashicorp==3.8.0 apache-airflow-providers-http==4.13.0 apache-airflow-providers-imap==3.7.0 apache-airflow-providers-microsoft-azure==10.4.0 apache-airflow-providers-mongo==4.2.1 apache-airflow-providers-mysql==5.7.0 apache-airflow-providers-odbc==4.7.0 apache-airflow-providers-openlineage==1.11.0 apache-airflow-providers-postgres==5.12.0 apache-airflow-providers-redis==3.8.0 apache-airflow-providers-sendgrid==3.6.0 apache-airflow-providers-sftp==4.11.0 apache-airflow-providers-slack==8.9.0 apache-airflow-providers-smtp==1.8.0 apache-airflow-providers-snowflake==5.7.0 apache-airflow-providers-sqlite==3.9.0 apache-airflow-providers-ssh==3.13.1

Deployment

Official Apache Airflow Helm Chart

Deployment details

Docker used to create the image from the original:

FROM apache/airflow:2.10.2

RUN pip install --no-cache-dir --upgrade "pip"
RUN pip install --no-cache-dir --upgrade \
    "apache-airflow-providers-google" \
    "apache-airflow-providers-mongo" \
    "google-ads" \
    "google-api-core" \
    "google-api-python-client" \
    "google-auth" \
    "google-auth-httplib2" \
    "google-auth-oauthlib" \
    "google-cloud-aiplatform" \
    "google-cloud-appengine-logging" \
    "google-cloud-audit-log" \
    "google-cloud-automl" \
    "google-cloud-bigquery" \
    "google-cloud-bigquery-datatransfer" \
    "google-cloud-bigquery-storage" \
    "google-cloud-bigtable" \
    "google-cloud-build" \
    "google-cloud-container" \
    "google-cloud-core" \
    "google-cloud-datacatalog" \
    "google-cloud-dataform" \
    "google-cloud-dataplex" \
    "google-cloud-dataproc" \
    "google-cloud-dataproc-metastore" \
    "google-cloud-dlp" \
    "google-cloud-kms" \
    "google-cloud-language" \
    "google-cloud-logging" \
    "google-cloud-memcache" \
    "google-cloud-monitoring" \
    "google-cloud-orchestration-airflow" \
    "google-cloud-os-login" \
    "google-cloud-pubsub" \
    "google-cloud-redis" \
    "google-cloud-secret-manager" \
    "google-cloud-spanner" \
    "google-cloud-speech" \
    "google-cloud-storage" \
    "google-cloud-tasks" \
    "google-cloud-texttospeech" \
    "google-cloud-translate" \
    "google-cloud-videointelligence" \
    "google-cloud-vision" \
    "google-cloud-workflows" \
    "mailchimp-marketing"

Anything else?

More context:

Airflow is installed in a test and production environment. Test environment is with 2.10.2 version. Production environment is with 2.10.0 version.

On test environment, the issue happens. On production environment, the issue does not happen,

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 month ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Lee2532 commented 1 month ago

I think it's because the DAG has already been exposed to the web server by the scheduler.

For cases where DAG has already been exposed, will it still be exposed even if I delete that DAG after a certain time after adding airflowignore?

c-thiel commented 3 weeks ago

We encountered a related issue: Airflow was parsing files it wasn't supposed to parse and showed "Broken DAG" in the UI. We then added the files to .airflowignore. Airflow still showed the Broken DAGs. We checked the code and found that currently no process is using .airflowignore to clear or filter files from the import_error table. So we truncated the table manually and now everything is OK again.

progressive-scaler commented 3 days ago

While trying to solve this issue, i made the following validations:

After this outcomes, i just removed the DAGs via Airflow GUI. After that, they were never re-appeared in Airflow GUI.

Another extra note: this happened after upgrading Airflow on a test environment. When we upgrade the production environment, this issue never happened. I could not get to the potential reasons on why it happened on test environment and not on production environment.