airflow-helm / charts

The User-Community Airflow Helm Chart is the standard way to deploy Apache Airflow on Kubernetes with Helm. Originally created in 2017, it has since helped thousands of companies create production-ready deployments of Airflow on Kubernetes.
https://github.com/airflow-helm/charts/tree/main/charts/airflow
Apache License 2.0
630 stars 474 forks source link

pods stuck in wait-for-airflow-migrations #868

Closed espenthaem closed 3 weeks ago

espenthaem commented 3 weeks ago

Checks

Chart Version

1.13.1

Kubernetes Version

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.13-eks-3af4770
WARNING: version difference between client (1.30) and server (1.27) exceeds the supported minor version skew of +/-1

Helm Version

version.BuildInfo{Version:"v3.15.1", GitCommit:"e211f2aa62992bd72586b395de50979e31231829", GitTreeState:"clean", GoVersion:"go1.22.3"}

Description

I'm trying to deploy using Airflow but my scheduler, triggerer and webserver pods are forever stuck in wait-for-airflow-migrations to finish. However, a db-migrations job is never actually started.

I'm using a customer docker image to include my package requirements:

# Use the official Apache Airflow image from Docker Hub
FROM apache/airflow:2.8.3
USER root
# Set environment variables
ENV AIRFLOW_HOME=/opt/airflow
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
        libsasl2-dev \
        gcc \
        build-essential \
        unzip \
        python3-dev \
        default-libmysqlclient-dev \
        libpq-dev \
        jq \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

USER airflow
COPY requirements/requirements.txt .
ENV PIP_ENV_VERSION=24.0
RUN python -m pip install --no-cache-dir --upgrade pip==${PIP_ENV_VERSION}

COPY requirements/requirements.txt /tmp/tmp-pip/
RUN python -m pip install --no-cache-dir -r /tmp/tmp-pip/requirements.txt 
RUN pip list
ENTRYPOINT ["bash", "-c", "airflow db init && airflow webserver & airflow scheduler"]

Relevant Logs

pod/airflow-postgresql-0                 1/1     Running    0              2m14s
pod/airflow-scheduler-79f69f58cf-75g6j   0/3     Init:0/2   0              2m14s
pod/airflow-statsd-7c56d8b68-5qntz       1/1     Running    0              2m14s
pod/airflow-triggerer-0                  0/3     Init:0/2   1 (119s ago)   2m14s
pod/airflow-webserver-6cdb66595f-pr7xc   0/1     Init:0/1   0              2m14s

Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  2m26s  default-scheduler  Successfully assigned airflow-dags/airflow-webserver-6cdb66595f-pr7xc to ip-10-30-196-195.eu-west-1.compute.internal
  Normal  Pulling    2m26s  kubelet            Pulling image "eu.gcr.io/medicuja/airflow-dags:ci_cd_helm_dependency-7"
  Normal  Pulled     116s   kubelet            Successfully pulled image "eu.gcr.io/my-project/airflow-dags:my-image-tag" in 30.190834116s (30.190864716s including waiting)
  Normal  Created    115s   kubelet            Created container wait-for-airflow-migrations
  Normal  Started    115s   kubelet            Started container wait-for-airflow-migrations

Custom Helm Values

images:
  airflow:
    repository: eu.gcr.io/my-project/airflow-dags
    tag: latest
airflow:
  dbMigrations:
    enabled: True
    runAsJob: True
dags:
  gitSync:
    branch: main
    enabled: true
    repo: 'git@github.com:my-repo/airflow-dags.git'
    subPath: 'dags'
    rev: HEAD
    sshKeySecret: airflow-git-key
  persistence:
    accessMode: ReadWriteOnce
    annotations: {}
    enabled: false
    existingClaim: null
    size: 1Gi
    storageClassName: null
    subPath: null
config:
  webserver:
    expose_config: 'True'
executor: KubernetesExecutor
extraEnv: |
    - name: "AIRFLOW__CORE__PLUGINS_FOLDER"
      value: "/opt/airflow/dags/repo/plugins"
    - name: AIRFLOW__CORE__LOAD_EXAMPLES
      value: "True"
    - name: PYTHONPATH
      value: "/opt/airflow/dags/repo"
registry:
  secretName: my-project-gcr-secret-basic-service

I'm not using a --wait flag and I'm not deploying using ArgoCD. Here's my deploy statement:

helm upgrade airflow apache-airflow/airflow --namespace airflow-dags -f helm/values.yaml --set 'images.airflow.tag=image-tag' --atomic --install --timeout=5m
espenthaem commented 3 weeks ago

I've also discovered I can force the migration job to run by disabling the Helm hooks on the migrationDataBaseJob:

migrateDatabaseJob:
  useHelmHooks: false
  enabled: true

The migrations jobs seems to complete the init and migration itself, but never shut downs

WARNING:root:OSError while attempting to symlink the latest log directory
DB: postgresql://postgres:***@airflow-postgresql.airflow-dags:5432/postgres?sslmode=disable
/home/airflow/.local/lib/python3.8/site-packages/airflow/cli/commands/db_command.py:47 DeprecationWarning: `db init` is deprecated.  Use `db migrate` instead to migrate the db and/or airflow connections create-default-connections to create the default connections
[2024-06-07T10:38:47.561+0000] {migration.py:216} INFO - Context impl PostgresqlImpl.
[2024-06-07T10:38:47.571+0000] {migration.py:219} INFO - Will assume transactional DDL.
[2024-06-07T10:38:48.137+0000] {migration.py:216} INFO - Context impl PostgresqlImpl.
[2024-06-07T10:38:48.137+0000] {migration.py:219} INFO - Will assume transactional DDL.
[2024-06-07T10:38:48.166+0000] {db.py:1623} INFO - Creating tables
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
INFO  [alembic.runtime.migration] Context impl PostgresqlImpl.
INFO  [alembic.runtime.migration] Will assume transactional DDL.
[2024-06-07T10:38:49.226+0000] {task_context_logger.py:63} INFO - Task context logging is enabled
[2024-06-07T10:38:49.227+0000] {executor_loader.py:115} INFO - Loaded executor: KubernetesExecutor
/home/airflow/.local/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py:165 FutureWarning: The config section [kubernetes] has been renamed to [kubernetes_executor]. Please update your `conf.get*` call to use the new name
[2024-06-07T10:38:49.433+0000] {scheduler_job_runner.py:808} INFO - Starting the scheduler
[2024-06-07T10:38:49.434+0000] {scheduler_job_runner.py:815} INFO - Processing each file at most -1 times
[2024-06-07T10:38:49.436+0000] {kubernetes_executor.py:318} INFO - Start Kubernetes executor
[2024-06-07T10:38:49.514+0000] {kubernetes_executor_utils.py:157} INFO - Event: and now my watch begins starting at resource_version: 0
[2024-06-07T10:38:49.520+0000] {kubernetes_executor.py:239} INFO - Found 0 queued task instances
[2024-06-07T10:38:49.535+0000] {manager.py:169} INFO - Launched DagFileProcessorManager with pid: 37
[2024-06-07T10:38:49.548+0000] {scheduler_job_runner.py:1608} INFO - Adopting or resetting orphaned tasks for active dag runs
[2024-06-07T10:38:49.586+0000] {settings.py:60} INFO - Configured default timezone UTC
[2024-06-07T10:38:49.682+0000] {settings.py:541} INFO - Loaded airflow_local_settings from /opt/airflow/config/airflow_local_settings.py .
[2024-06-07T10:38:49.713+0000] {scheduler_job_runner.py:1631} INFO - Marked 3 SchedulerJob instances as failed
Initialization done
[2024-06-07T10:39:07.533+0000] {configuration.py:2066} INFO - Creating new FAB webserver config file in: /opt/airflow/webserver_config.py
  ____________       _____________
 ____    |__( )_________  __/__  /________      __
____  /| |_  /__  ___/_  /_ __  /_  __ \_ | /| / /
___  ___ |  / _  /   _  __/ _  / / /_/ /_ |/ |/ /
 _/_/  |_/_/  /_/    /_/    /_/  \____/____/|__/
Running the Gunicorn Server with:
Workers: 4 sync
Host: 0.0.0.0:8080
Timeout: 120
Logfiles: - -
Access Logformat: 
espenthaem commented 3 weeks ago

I've just realized I'm actually not using the community edition of Airflow helm chart. My bad. I'll close this.