apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.26k stars 14.08k forks source link

Add Production-ready docker compose for the production image #8605

Closed potiuk closed 3 years ago

potiuk commented 4 years ago

Description

In order to use the production image we are already working on a helm chart, but we might want to add a production-ready docker compose that will be able to run airflow installation.

Use case / motivation

For local tests/small deployments - being able to have such docker-compose environment would be really nice.

We seem to get to consensus that we need to have several docker-compose "sets" of files:

They should be varianted and possible to specify the number of parameters:

Depending on the setup, those Docker compose file should do proper DB initialisation.


Example Docker Compose (From https://apache-airflow.slack.com/archives/CQAMHKWSJ/p1587748008106000) that we might use as a base and #8548 . This is just example so this issue will not implement all of it and we will likely split those docker-compose into separate postgres/sqlite/mysql similarly as we do in CI script, so I wanted to keep it as separate issue - we will deal with user creation in #8606

version: '3'
services:
  postgres:
    image: postgres:latest
    environment:
      - POSTGRES_USER=postgres
      - POSTGRES_PASSWORD=postgres
      - POSTGRES_DB=airflow
      - POSTGRES_PORT=5432
    ports:
      - 5432:5432
  redis:
    image: redis:latest
    ports:
      - 6379:6379
  flower:
    image: apache/airflow:1.10.10
    volumes:
      - ./airflow-data/dags:/opt/airflow/dags
    environment:
      - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
      - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
      - AIRFLOW__WEBSERVER__RBAC=True
    command: flower
    ports:
      - 5555:5555
  airflow:
    image: apache/airflow:1.10.10
    environment:
      - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
      - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
      - AIRFLOW__WEBSERVER__RBAC=True
    command: webserver
    ports:
      - 8080:8080
    volumes:
      - ./airflow-data/dags:/opt/airflow/dags
      - ./airflow-data/logs:/opt/airflow/logs
      - ./airflow-data/plugins:/opt/airflow/plugins
  airflow-scheduler:
    image: apache/airflow:1.10.10
    container_name: airflow_scheduler_cont
    environment:
      - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
      - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
      - AIRFLOW__WEBSERVER__RBAC=True
    command: scheduler
    volumes:
      - ./airflow-data/dags:/opt/airflow/dags
      - ./airflow-data/logs:/opt/airflow/logs
      - ./airflow-data/plugins:/opt/airflow/plugins
  airflow-worker1:
    image: apache/airflow:1.10.10
    container_name: airflow_worker1_cont
    environment:
      - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
      - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
      - AIRFLOW__WEBSERVER__RBAC=True
    command: worker
    volumes:
      - ./airflow-data/dags:/opt/airflow/dags
      - ./airflow-data/logs:/opt/airflow/logs
      - ./airflow-data/plugins:/opt/airflow/plugins
  airflow-worker2:
    image: apache/airflow:1.10.10
    container_name: airflow_worker2_cont
    environment:
      - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
      - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
      - AIRFLOW__WEBSERVER__RBAC=True
    command: worker
    volumes:
      - ./airflow-data/dags:/opt/airflow/dags
      - ./airflow-data/logs:/opt/airflow/logs
      - ./airflow-data/plugins:/opt/airflow/plugins
  airflow-worker3:
    image: apache/airflow:1.10.10
    container_name: airflow_worker3_cont
    environment:
      - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
      - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
      - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://postgres:postgres@postgres:5432/airflow
      - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
      - AIRFLOW__CORE__LOAD_EXAMPLES=False
      - AIRFLOW__WEBSERVER__RBAC=True
    command: worker
    volumes:
      - ./airflow-data/dags:/opt/airflow/dags
      - ./airflow-data/logs:/opt/airflow/logs
      - ./airflow-data/plugins:/opt/airflow/plugins

Another example from https://apache-airflow.slack.com/archives/CQAMHKWSJ/p1587679356095400:

version: '3.7'
networks:
  airflow:
    name: airflow
    attachable: true
volumes:
  logs:
x-database-env: 
  &database-env
  POSTGRES_USER: airflow
  POSTGRES_DB: airflow
  POSTGRES_PASSWORD: airflow
x-airflow-env: 
  &airflow-env
  AIRFLOW__CORE__EXECUTOR: CeleryExecutor
  AIRFLOW__WEBSERVER__RBAC: 'True'
  AIRFLOW__CORE__CHECK_SLAS: 'False'
  AIRFLOW__CORE__STORE_SERIALIZED_DAGS: 'False'
  AIRFLOW__CORE__PARALLELISM: 50
  AIRFLOW__CORE__LOAD_EXAMPLES: 'False'
  AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS: 'False'
  AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: 10

services:
  postgres:
    image: postgres:11.5
    environment:
      <<: *database-env
      PGDATA: /var/lib/postgresql/data/pgdata
    ports:
      - 5432:5432
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - ./database/data:/var/lib/postgresql/data/pgdata
      - ./database/logs:/var/lib/postgresql/data/log
    command: >
     postgres
       -c listen_addresses=*
       -c logging_collector=on
       -c log_destination=stderr
       -c max_connections=200
    networks:
      - airflow
  redis:
    image: redis:5.0.5
    environment:
      REDIS_HOST: redis
      REDIS_PORT: 6379
    ports:
      - 6379:6379
    networks:
      - airflow
  webserver:
    image: airflow:1.10.10
    user: airflow
    ports:
      - 8090:8080
    volumes:
      - ./dags:/opt/airflow/dags
      - logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      <<: *database-env
      <<: *airflow-env
      ADMIN_PASSWORD: airflow
    depends_on:
      - postgres
      - redis
    command: webserver
    healthcheck:
      test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
      interval: 30s
      timeout: 30s
      retries: 3
    networks:
      - airflow
  flower:
    image: airflow:1.10.10
    user: airflow
    ports:
      - 5555:5555
    depends_on:
      - redis
    volumes:
      - logs:/opt/airflow/logs
    command: flower
    networks:
      - airflow
  scheduler:
    image: airflow:1.10.10
    volumes:
      - ./dags:/opt/airflow/dags
      - logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      <<: *database-env
    command: scheduler
    networks:
      - airflow
  worker:
    image: airflow:1.10.10
    user: airflow
    volumes:
      - ./dags:/opt/airflow/dags
      - logs:/opt/airflow/logs
      - ./files:/opt/airflow/files
      - /var/run/docker.sock:/var/run/docker.sock
    environment:
      <<: *database-env
    command: worker
    depends_on:
      - scheduler

Related issues The initial user creation #8606, #8548 Quick start documentation planned in #8542

potiuk commented 4 years ago

This one duplicates #8548 a bit - but I want to leave it for a while as I wanted to split it into smaller functional pieces.

kaxil commented 4 years ago

It would be nice to have this in "Quick Start Guide when using Docker Image" too. WDYT

potiuk commented 4 years ago

Absolutely. It's already planned in #8542 :)

potiuk commented 4 years ago

Added missing label :)

habibdhif commented 4 years ago

Here is another example of a Docker Compose that I've been working on. The Compose defines multiple services to run Airflow. There is an init service which is an ephemeral container to initialize the database and creates a user if necessary. The init service command tries to run airflow list_users and if it fails it initializes the database and creates a user. Different approaches were considered but this one is simple enough and only involves airflow commands (no database-specific commands).

Extension fields are used for airflow environment variables to reduce code duplication.

I added a Makefile along the docker-compose.yml in my repo so all you have to do to run the docker-compose is run make run.

version: "3.7"
x-airflow-environment: &airflow-environment
  AIRFLOW__CORE__EXECUTOR: CeleryExecutor
  AIRFLOW__WEBSERVER__RBAC: "True"
  AIRFLOW__CORE__LOAD_EXAMPLES: "False"
  AIRFLOW__CELERY__BROKER_URL: "redis://:@redis:6379/0"
  AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow

services:
  postgres:
    image: postgres:11.5
    environment:
      POSTGRES_USER: airflow
      POSTGRES_DB: airflow
      POSTGRES_PASSWORD: airflow
  redis:
    image: redis:5
    environment:
      REDIS_HOST: redis
      REDIS_PORT: 6379
    ports:
      - 6379:6379
  init:
    image: apache/airflow:1.10.10
    environment:
      <<: *airflow-environment
    depends_on:
      - redis
      - postgres
    volumes:
      - ./dags:/opt/airflow/dags
    entrypoint: /bin/bash
    command: >
      -c "airflow list_users || (airflow initdb
      && airflow create_user --role Admin --username airflow --password airflow -e airflow@airflow.com -f airflow -l airflow)"
    restart: on-failure
  webserver:
    image: apache/airflow:1.10.10
    ports:
      - 8080:8080
    environment:
      <<: *airflow-environment
    depends_on:
      - init
    volumes:
      - ./dags:/opt/airflow/dags
    command: "webserver"
    restart: always
  flower:
    image: apache/airflow:1.10.10
    ports:
      - 5555:5555
    environment:
      <<: *airflow-environment
    depends_on:
      - redis
    command: flower
    restart: always
  scheduler:
    image: apache/airflow:1.10.10
    environment:
      <<: *airflow-environment
    depends_on:
      - webserver
    volumes:
      - ./dags:/opt/airflow/dags
    command: scheduler
    restart: always
  worker:
    image: apache/airflow:1.10.10
    environment:
      <<: *airflow-environment
    depends_on:
      - scheduler
    volumes:
      - ./dags:/opt/airflow/dags
    command: worker
    restart: always
infused-kim commented 4 years ago

Here's my docker-compose config using LocalExecutor...

docker-compose.airflow.yml:

version: '2.1'
services:
    airflow:
        # image: apache/airflow:1.10.10
        build:
            context: .
            args:
                - DOCKER_UID=${DOCKER_UID-1000} 
            dockerfile: Dockerfile
        restart: always
        environment:
            - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres://airflow:${POSTGRES_PW-airflow}@postgres:5432/airflow
            - AIRFLOW__CORE__FERNET_KEY=${AF_FERNET_KEY-GUYoGcG5xdn5K3ysGG3LQzOt3cc0UBOEibEPxugDwas=}
            - AIRFLOW__CORE__EXECUTOR=LocalExecutor
            - AIRFLOW__CORE__AIRFLOW_HOME=/opt/airflow/
            - AIRFLOW__CORE__LOAD_EXAMPLES=False
            - AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
            - AIRFLOW__CORE__LOGGING_LEVEL=${AF_LOGGING_LEVEL-info}
        volumes:
            - ../airflow/dags:/opt/airflow/dags:z
            - ../airflow/plugins:/opt/airflow/plugins:z
            - ./volumes/airflow_data_dump:/opt/airflow/data_dump:z
            - ./volumes/airflow_logs:/opt/airflow/logs:z
        healthcheck:
            test: ["CMD-SHELL", "[ -f /opt/airflow/airflow-webserver.pid ]"]
            interval: 30s
            timeout: 30s
            retries: 3

docker-compose.yml:

version: '2.1'
services:
    postgres:
        image: postgres:9.6
        container_name: af_postgres
        environment:
            - POSTGRES_USER=airflow
            - POSTGRES_PASSWORD=${POSTGRES_PW-airflow}
            - POSTGRES_DB=airflow
            - PGDATA=/var/lib/postgresql/data/pgdata
        volumes:
            - ./volumes/postgres_data:/var/lib/postgresql/data/pgdata:Z
        ports:
            -  127.0.0.1:5432:5432

    webserver:
        extends:
            file: docker-compose.airflow.yml
            service: airflow
        container_name: af_webserver
        command: webserver
        depends_on:
            - postgres
        ports:
            - ${DOCKER_PORTS-8080}
        networks:
            - proxy
            - default
        environment:
            # Web Server Config
            - AIRFLOW__WEBSERVER__DAG_DEFAULT_VIEW=graph
            - AIRFLOW__WEBSERVER__HIDE_PAUSED_DAGS_BY_DEFAULT=true
            - AIRFLOW__WEBSERVER__RBAC=true

            # Web Server Performance tweaks
            # 2 * NUM_CPU_CORES + 1
            - AIRFLOW__WEBSERVER__WORKERS=${AF_WORKERS-2}
            # Restart workers every 30min instead of 30seconds
            - AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800
        labels:
            - "traefik.enable=true"
            - "traefik.http.routers.airflow.rule=Host(`af.example.com`)"
            - "traefik.http.routers.airflow.middlewares=admin-auth@file"

    scheduler:
        extends:
            file: docker-compose.airflow.yml
            service: airflow
        container_name: af_scheduler
        command: scheduler
        depends_on:
            - postgres
        environment:
            # Performance Tweaks
            # Reduce how often DAGs are reloaded to dramatically reduce CPU use
            - AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=${AF_MIN_FILE_PROCESS_INTERVAL-60} 
            - AIRFLOW__SCHEDULER__MAX_THREADS=${AF_THREADS-1}

networks:
    proxy:
        external: true

Dockerfile:

# Custom Dockerfile
FROM apache/airflow:1.10.10

# Install mssql support & dag dependencies
USER root
RUN apt-get update -yqq \
    && apt-get install -y gcc freetds-dev \
    && apt-get install -y git procps \ 
    && apt-get install -y vim
RUN pip install apache-airflow[mssql,mssql,ssh,s3,slack] 
RUN pip install azure-storage-blob sshtunnel google-api-python-client oauth2client \
    && pip install git+https://github.com/infusionsoft/Official-API-Python-Library.git \
    && pip install rocketchat_API

# This fixes permission issues on linux. 
# The airflow user should have the same UID as the user running docker on the host system.
# make build is adjust this value automatically
ARG DOCKER_UID
RUN \
    : "${DOCKER_UID:?Build argument DOCKER_UID needs to be set and non-empty. Use 'make build' to set it automatically.}" \
    && usermod -u ${DOCKER_UID} airflow \
    && find / -path /proc -prune -o -user 50000 -exec chown -h airflow {} \; \
    && echo "Set airflow's uid to ${DOCKER_UID}"

USER airflow

Makefile

And here's my Makefile to control it the containers like make run:

SERVICE = "scheduler"
TITLE = "airflow containers"
ACCESS = "http://af.example.com"

.PHONY: run

build:
    docker-compose build

run:
    @echo "Starting $(TITLE)"
    docker-compose up -d
    @echo "$(TITLE) running on $(ACCESS)"

runf:
    @echo "Starting $(TITLE)"
    docker-compose up

stop:
    @echo "Stopping $(TITLE)"
    docker-compose down

restart: stop print-newline run

tty:
    docker-compose run --rm --entrypoint='' $(SERVICE) bash

ttyr:
    docker-compose run --rm --entrypoint='' -u root $(SERVICE) bash

attach:
    docker-compose exec $(SERVICE) bash

attachr:
    docker-compose exec -u root $(SERVICE) bash

logs:
    docker-compose logs --tail 50 --follow $(SERVICE)

conf:
    docker-compose config

initdb:
    docker-compose run --rm $(SERVICE) initdb

upgradedb:
    docker-compose run --rm $(SERVICE) upgradedb

print-newline:
    @echo ""
    @echo ""
wittfabian commented 4 years ago

@potiuk Is this the preferred way to add dependencies (airflow-mssql)?

# Custom Dockerfile
FROM apache/airflow:1.10.10

# Install mssql support & dag dependencies
USER root
RUN apt-get update -yqq \
    && apt-get install -y gcc freetds-dev \
    && apt-get install -y git procps \ 
    && apt-get install -y vim
RUN pip install apache-airflow[mssql,mssql,ssh,s3,slack] 
RUN pip install azure-storage-blob sshtunnel google-api-python-client oauth2client \
    && pip install git+https://github.com/infusionsoft/Official-API-Python-Library.git \
    && pip install rocketchat_API

# This fixes permission issues on linux. 
# The airflow user should have the same UID as the user running docker on the host system.
# make build is adjust this value automatically
ARG DOCKER_UID
RUN \
    : "${DOCKER_UID:?Build argument DOCKER_UID needs to be set and non-empty. Use 'make build' to set it automatically.}" \
    && usermod -u ${DOCKER_UID} airflow \
    && find / -path /proc -prune -o -user 50000 -exec chown -h airflow {} \; \
    && echo "Set airflow's uid to ${DOCKER_UID}"

USER airflow
potiuk commented 4 years ago

I the preferred way will be to set properly AIRFLOW_EXTRAS variable and pass them as --build-arg

They are defined like that in the Dockerfile:

ARG AIRFLOW_EXTRAS="async,aws,azure,celery,dask,elasticsearch,gcp,kubernetes,mysql,postgres,redis,slack,ssh,statsd,virtualenv"

and when building the dockerfile you can set them as --build-arg AIRFLOW_EXTRAS="...."

I think that maybe it's worth to have "additional extras" and append them though

infused-kim commented 4 years ago

Oh, that's super cool. But for that you have to rebuild the entire airflow image? Can you just add the build arg in the docker-compose and it will propagate through to the published airflow image?

potiuk commented 4 years ago

You should also be able to build a new image using ON_BUILD feature - for building images depending on the base one. Added a separate issue here: #8872

wittfabian commented 4 years ago

The same applies to additional Python packages. https://github.com/puckel/docker-airflow/blob/master/Dockerfile#L64

if [ -n "${PYTHON_DEPS}" ]; then pip install ${PYTHON_DEPS}; fi

feluelle commented 4 years ago

My Apache Airflow docker-compose file for running LocalExecutor with postgres using official production Dockerfile

Moved to gist

xnuinside commented 4 years ago

my two cents: https://github.com/xnuinside/airflow_in_docker_compose/blob/master/docker-compose-with-celery-executor.yml and .env file for it https://github.com/xnuinside/airflow_in_docker_compose/blob/master/.env

Ready for up&run. But for prod need turn on RBAC.

JavierLopezT commented 4 years ago

Hello. I made kind of a mix of the examples here to make my own set of docker files. I ended up with docker-compose, Dockerfile and Makefile. Using the docker-compose and Makefile from this post as a starting point, I have already solved some of the problems we encountered as we adapted it to our needs, but as a Docker and Airflow noob, I would have liked if these needs had already been addressed by an agreed-upon best-practice solution, so I'll mention them just in case you can include them in the future file (or mention how to address them in a tutorial or something):

Regarding the docker-compose, I would like to see an explanation of why having separately the webserver and the scheduler, how it works... For instance, I don't understand if in some cases a command could be added to just one of the containers

The code I have currently is: Dockerfile

FROM apache/airflow:1.10.10

COPY plugins/aws_secrets_manager_backend.py /home/airflow/.local/lib/python3.6/site-packages/airflow/contrib/secrets/aws_secrets_manager.py
COPY plugins/aws_secrets_manager_hook.py /home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/aws_secrets_manager_hook.py

COPY hooks_init.py /home/airflow/.local/lib/python3.6/site-packages/airflow/hooks/__init__.py
COPY aws_config /home/airflow/.aws/config
COPY aws_credentials /home/airflow/.aws/credentials
COPY requirements.txt requirements.txt

RUN pip3 install -r requirements.txt --user

docker-compose

version: "3.7"
x-airflow-environment: &airflow-environment
  AIRFLOW__CORE__EXECUTOR: LocalExecutor
  AIRFLOW__CORE__LOAD_EXAMPLES: "False"
  AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres:5432/airflow
  AIRFLOW__CORE__FERNET_KEY: FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
  AIRFLOW__WEBSERVER__DAG_DEFAULT_VIEW: graph
  AIRFLOW__SECRETS__BACKEND: airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
  AIRFLOW__OPERATORS__DEFAULT_RAM: 2048

services:
  postgres:
    image: postgres:11.5
    environment:
      POSTGRES_USER: airflow
      POSTGRES_DB: airflow
      POSTGRES_PASSWORD: airflow
  init:
    build: .
    environment:
      <<: *airflow-environment
    depends_on:
      - postgres
    volumes:
      - ./dags:/opt/airflow/dags
      - ./plugins:/opt/airflow/plugins
      - ./logs:/opt/airflow/logs
    entrypoint: /bin/bash
    command: >
      -c "airflow list_users || (airflow initdb
      && airflow create_user --role Admin --username airflow --password airflow -e airflow@airflow.com -f airflow -l airflow)"
    restart: on-failure
  webserver:
    build: .
    ports:
      - 8080:8080
    environment:
      <<: *airflow-environment
    depends_on:
      - init
    volumes:
      - ./dags:/opt/airflow/dags
      - ./plugins:/opt/airflow/plugins
      - ./variables_secret.json:/opt/airflow/variables_secret.json
      - ./logs:/opt/airflow/logs
      - ./utilities:/opt/airflow/utilities
    entrypoint: /bin/bash
    command: -c "airflow variables -i /opt/airflow/variables_secret.json && airflow webserver"
    restart: always
  scheduler:
    build: .
    environment:
      <<: *airflow-environment
    depends_on:
      - webserver
    volumes:
      - ./dags:/opt/airflow/dags
      - ./plugins:/opt/airflow/plugins
      - ./variables_secret.json:/opt/airflow/variables_secret.json
      - ./logs:/opt/airflow/logs
      - ./utilities:/opt/airflow/utilities
    entrypoint: /bin/bash
    command: -c "airflow variables -i /opt/airflow/variables_secret.json && airflow scheduler"
    restart: always

Makefile

.PHONY: run stop rm

run:
    docker-compose -f docker-compose.yml up -d --remove-orphans --build --force-recreate
    @echo "Airflow running on http://localhost:8080"

stop:
    docker-compose -f docker-compose.yml stop

rm: stop
    docker-compose -f docker-compose.yml rm

Kind regards

JavierLopezT commented 4 years ago

Hello. I have encountered another issue. I want to use make html command of sphinx within the container. I have found that the make command is not found. So I have added to the Dockerfile the following lines:

USER root
RUN sudo apt-get update && sudo apt-get install build-essential -y
USER airflow

Maybe not the best approach and there is a good idea to address this somewhere else. Kind regards

potiuk commented 4 years ago

Hello. I have encountered another issue. I want to use make html command of sphinx within the container. I have found that the make command is not found. So I have added to the Dockerfile the following lines:

USER root
RUN sudo apt-get update && sudo apt-get install build-essential -y
USER airflow

Maybe not the best approach and there is a good idea to address this somewhere else. Kind regards

This is already addressed - see thehttps://github.com/apache/airflow/blob/master/IMAGES.rst#production-images where examples are shown how to manually build images.

The production image is highly optimized for size so it is multi-segmented one - the first segment is used to add "build dependencies" (and build-essentials are there) but then only compiled libraries and python code are copied to the "main" image - which makes it 200MB instead of 400MB at least.

Your best bet, in this case, is to add commands to the Dockerfile in the "build" segment and copy whatever is the result of it via COPY --from if you are using production node.

Also, you can watch my talk about it from the Airflow Summit - where I talk about the image https://s.apache.org/airflow-prod-image

infused-kim commented 4 years ago

Instead of modifying the existing image, you can also build from the finished image and add your own stuff.

Here's my image for example:

FROM apache/airflow:1.10.12

USER root

# This fixes permission issues on linux. 
# The airflow user should have the same UID as the user running docker on the host system.
# make build is adjust this value automatically
ARG DOCKER_UID
RUN \
    : "${DOCKER_UID:?Build argument DOCKER_UID needs to be set and non-empty. Use 'make build' to set it automatically.}" \
    && usermod -u ${DOCKER_UID} airflow \
    && groupmod -g ${DOCKER_UID} airflow \
    && chown -Rhc --from=50000 ${DOCKER_UID} / || true \
    && chown -Rhc --from=:50000 :${DOCKER_UID} / || true \
    && echo "Set airflow's uid and gid to ${DOCKER_UID}"

# Install cmd utils
RUN apt-get update -yqq \
    && apt-get install -y git \
                          procps \ 
                          vim

# Install MS SQL Support (ODBC Driver)
RUN apt-get update && apt-get install -y gnupg curl && curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add --no-tty - && curl https://packages.microsoft.com/config/debian/10/prod.list > /etc/apt/sources.list.d/mssql-release.list && apt-get update && ACCEPT_EULA=Y apt-get install -y msodbcsql17 unixodbc-dev g++

USER airflow

# Install Apache 2.0 backports for mssql
RUN pip install --user apache-airflow-backport-providers-odbc \
                       apache-airflow-backport-providers-microsoft-mssql

# Install airflow packages
RUN pip install --user apache-airflow[slack]

# Install plugin dependencies
RUN pip install --user azure-storage-blob \
                       sshtunnel \
                       google-api-python-client \
                       oauth2client \
                       beautifulsoup4 \
                       dateparser \
                       rocketchat_API \
                       typeform
potiuk commented 4 years ago

Instead of modifying the existing image, you can also build from the finished image and add your own stuff.

That's true - you can do that. The drawback of this solution though is that the image will be much bigger (in this case it will contain unixodbc-dev and g++ which on its own drags a number of dependencies (most of build-essentials) which likely adds several 100s of MB of stuff that is not needed in the final image.

I am actually thinking on how to make it even easier to accommodate such cases and will think a bit how this can be done and try to address that in #10856

infused-kim commented 4 years ago

I just watched your presentation on docker and it's really amazing how much thought and effort you put in the image and the optimization.

Building the custom size-optimized image from source is a great option for many people, especially if they are working in corporate and need the security review.

But for many other, especially smaller businesses, ease of getting started, setup and maintenance can be more important.

So after reviewing both options, I will stick with the extension option.

I am also looking forward to seeing official docker-compose files. I think right now getting started with airflow in docker is a bit difficult.

It used to be very easy with puckel's image, but it's outdated now. And if someone wants to run airflow in docker, they have to come up with their own compose files and hunt down examples all over the place.

It can be difficult without an official example, especially considering you have to run multiple containers for the scheduler and webserver.

infused-kim commented 4 years ago

One more thought:

I think there are multiple purposes for a docker image. The most important is of course running in production that justifies a more complicated process.

But I think many people also use docker to quickly test-drive software.

I know that from my own experience. Whenever I consider a new open source software, the first thing I check is whether they have a docker image and ideally a docker-compose example.

This way I can get a decent example setup in just a few minutes to evaluate the software.

So even if using docker-compose is not recommended in production, I think creating official examples would be great for the future adoption of Airflow.

potiuk commented 4 years ago

One more thought:

I think there are multiple purposes for a docker image. The most important is of course running in production that justifies a more complicated process.

But I think many people also use docker to quickly test-drive software.

I know that from my own experience. Whenever I consider a new open source software, the first thing I check is whether they have a docker image and ideally a docker-compose example.

This way I can get a decent example setup in just a few minutes to evaluate the software.

So even if using docker-compose is not recommended in production, I think creating official examples would be great for the future adoption of Airflow.

Agree!!

potiuk commented 4 years ago

Those are all valid points and I was kind of waiting for someone to come up with that. When I looked at puckel, it was not really "production ready" and it did pretty much "everything but the kitchen sink" ;). So I figured the best option will be to start from something well engineered, but with doing only one thing well being optimised for production and listen to people complaining what they miss from Puckel - and then implement it "well" without breaking the optimisatons.

Now - since I already heard that several times, seems like this is a super-valid use case that people want to use the image for and that's a lot of great information that might help me to design it well :).

I will take a look at that shortly!

infused-kim commented 4 years ago

Totally valid to wait for someone else to come up with that.

I have posted my setup here.

If it meets the AF quality standard, I would be happy to create a PR if that makes things easier for you.

kaxil commented 4 years ago

Since docker-compose is just used for DEV, please go for it @KimchaC . We can iterate over it if needed

potiuk commented 4 years ago

absolutely!

potiuk commented 3 years ago

@KimchaC FYI. I have just submitted #11176 PR that should make it possible to build even such complex images as you explained in https://github.com/apache/airflow/issues/8605#issuecomment-690065621 via passing appropriate build args. In fact I even made an example on how to build such image based on your example.

Thera are few more changes coming (we've implemented quite some extensions to the build process when working on a customer project and I am just contributing it back. The final version of the Dockerfile/Breeze/Docker build process we will come up with will produce super-optimized (for size) images, with very high customizability of all the components of the image building - in the way that you can even use it to build airflow image on an air-gapped system.

The nice thing about it, that the customer can fully rely on the Airlfow Dockerfile process and keep up with future changes and ad their own customisations as needed and have full control over the build process.

This is all result of our project with really big customer who was very concerned about security of the images, binaries and the whole build process they had. At the end we we are going to have (I hope ) a super-flexible image that we can develop further that will be equally easy to use in CI/ OSS environment and a more strict corporate environment. A few more PRs are coming (we already have them in the customer's fork, but we are bringing them in one-by-one). Everything is under the #11171 umbrella.

It woudl be great @KimchaC if you try it out with your setup, configuration, and maybe other customizations, and this way we could possibly implement anything that we missed. Looking forward to it!

potiuk commented 3 years ago

@KimchaC ^ PR merged. You could try to build your image now using the command line paratmeters and compare the size vs. your previous image. I bet it will be quite a bit smaller.

kvenkat88 commented 3 years ago

Hi @potiuk ,

In this link(https://hub.docker.com/r/apache/airflow/dockerfile), there is no production level dockerfile is there. Yesterday saw the production version.

I am finding difficulty understanding the production level image creation(mainly customizing the image),

I am following the below two mentioned links for reference,

https://airflow.readthedocs.io/en/latest/production-deployment.html#production-image-build-arguments

https://github.com/apache/airflow/blob/master/IMAGES.rst#production-images

This builds the production image in version 3.7 with additional airflow extras from 1.10.10 Pypi package and additional apt dev and runtime dependencies.(as per production-deployment.html)

docker build . \ --build-arg PYTHON_BASE_IMAGE="python:3.7-slim-buster" \ --build-arg PYTHON_MAJOR_MINOR_VERSION=3.7 \ --build-arg AIRFLOW_INSTALL_SOURCES="apache-airflow" \ --build-arg AIRFLOW_INSTALL_VERSION="==1.10.12" \ --build-arg AIRFLOW_CONSTRAINTS_REFERENCE="constraints-1-10" \ --build-arg AIRFLOW_SOURCES_FROM="empty" \ --build-arg AIRFLOW_SOURCES_TO="/empty" \ --build-arg ADDITIONAL_AIRFLOW_EXTRAS="jdbc" --build-arg ADDITIONAL_PYTHON_DEPS="pandas" --build-arg ADDITIONAL_DEV_APT_DEPS="gcc g++" --build-arg ADDITIONAL_RUNTIME_APT_DEPS="default-jre-headless" --tag my-image

  1. Whether we need to pass dockerfile with -f flag? I have tried above command and observed , it is looking for dockerfile.

  2. If AIRFLOW_INSTALL_SOURCES=".", it points the installation from local sources(as per documentation). How it works?

  3. When I use the above command with -f Dockerfile, during the build process, I am facing this exception while running the step COPY scripts/docker scripts/docker --> "COPY failed: stat /var/lib/docker/tmp/docker-builder125807076/scripts/docker: no such file or directory". Whether i have to clone the git repo and then have to use docker build command?

potiuk commented 3 years ago

In this link(https://hub.docker.com/r/apache/airflow/dockerfile), there is no production level dockerfile is there. Yesterday saw the production version.

I see it there (even in incognito mode). Must have been a temporary glitch of DockerHub.

  1. Whether we need to pass dockerfile with -f flag? I have tried above command and observed , it is looking for dockerfile.

As mentioned in the docs above, if you want to customize the image you need to checkout airflow sources and run the docker command inside the Airfllow sources. As it is in case of most Dockerfiles, they need context ("." in the command) and some extra files (for example entrypoint scripts) that have to be available in this context, and the easiest way it is to checkout Airflow Sources in the right version and customize the image from there.

You can find a nice description in here: https://airflow.readthedocs.io/en/latest/production-deployment.html - we moved the documentation to "docs" and it has not yet been released (but it will be in 1.10.13) - but you can use the "latest" version - it contains all detailed description of customizing vs. extending and even a nice table showing what are the differences - one point there is that you need to use Airflow sources to customize the image.

  1. If AIRFLOW_INSTALL_SOURCES=".", it points the installation from local sources(as per documentation). How it works?

See above - you need to run in inside checked out sources of Airflow.

  1. When I use the above command with -f Dockerfile, during the build process, I am facing this exception while running the step COPY scripts/docker scripts/docker --> "COPY failed: stat /var/lib/docker/tmp/docker-builder125807076/scripts/docker: no such file or directory". Whether i have to clone the git repo and then have to use docker build command?

Yes. That's the whole point - customisation only works if you have sources of Airflow checked out.

kaxil commented 3 years ago

I think we should get this one in sooner before 2.0.0rc1, is someone willing to work on this one??

kaxil commented 3 years ago

Also, I don't think docker-compose files need to be production-ready. It should just be meant for local-development or to quickly start / work on Airflow locally with different executors

potiuk commented 3 years ago

Also, I don't think docker-compose files need to be production-ready. It should just be meant for local-development or to quickly start / work on Airflow locally with different executors

Agree. Starting small is good.

ryw commented 3 years ago

@potiuk should we move milestone to 2.1 for this?

potiuk commented 3 years ago

Yep. Just did :).

mik-laj commented 3 years ago

My docker compose:

version: '3'
x-airflow-common:
  &airflow-common
  image: apache/airflow:1.10.12
  environment:
    - AIRFLOW__CORE__EXECUTOR=CeleryExecutor
    - AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql://root@mysql/airflow?charset=utf8mb4
    - AIRFLOW__CORE__SQL_ENGINE_COLLATION_FOR_IDS=utf8mb3_general_ci
    - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0
    - AIRFLOW__CELERY__RESULT_BACKEND=redis://:@redis:6379/0
    - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM=
    - AIRFLOW__CORE__LOAD_EXAMPLES=False
    - AIRFLOW__CORE__LOGGING_LEVEL=Debug
    - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=False
    - AIRFLOW__WEBSERVER__RBAC=True
    - AIRFLOW__CORE__STORE_SERIALIZED_DAGS=True
    - AIRFLOW__CORE__STORE_DAG_CODE=True
  volumes:
    - ./dags:/opt/airflow/dags
    - ./airflow-data/logs:/opt/airflow/logs
    - ./airflow-data/plugins:/opt/airflow/plugins
  depends_on:
    - redis
    - mysql

services:
  mysql:
    image: mysql:5.7
    environment:
      - MYSQL_ALLOW_EMPTY_PASSWORD=true
      - MYSQL_ROOT_HOST=%
      - MYSQL_DATABASE=airflow
    volumes:
      - ./mysql/conf.d:/etc/mysql/conf.d:ro
      - /dev/urandom:/dev/random   # Required to get non-blocking entropy source
      - ./airflow-data/mysql-db-volume:/var/lib/mysql
    ports:
      - "3306:3306"
    command:
      - mysqld
      - --character-set-server=utf8mb4
      - --collation-server=utf8mb4_unicode_ci

  redis:
    image: redis:latest
    ports:
      - 6379:6379

  flower:
    << : *airflow-common
    command: flower
    ports:
      - 5555:5555

  airflow-init:
    << : *airflow-common
    container_name: airflow_init
    entrypoint: /bin/bash
    command:
      - -c
      - airflow list_users || (
          airflow initdb &&
          airflow create_user
            --role Admin
            --username airflow
            --password airflow
            --email airflow@airflow.com
            --firstname airflow
            --lastname airflow
        )
    restart: on-failure

  airflow-webserver:
    << : *airflow-common
    command: webserver
    ports:
      - 8080:8080
    restart: always

  airflow-scheduler:
    << : *airflow-common
    container_name: airflow_scheduler
    command:
      - scheduler
      - --run-duration
      - '30'
    restart: always

  airflow-worker:
    << : *airflow-common
    container_name: airflow_worker1
    command: worker
    restart: always
mik-laj commented 3 years ago

@BasPH shared on Slack: one-line command to start Airflow in docker:

In case you’ve ever wondered how to get the Airflow image to work in a one-liner (for demo purposes), here’s how:

docker run -ti -p 8080:8080 -v yourdag.py:/opt/airflow/dags/yourdag.py --entrypoint=/bin/bash apache/airflow:2.0.0b3-python3.8 -c '(airflow db init && airflow users create --username admin --password admin --firstname Anonymous --lastname Admin --role Admin --email admin@example.org); airflow webserver & airflow scheduler'

Creates a user admin/admin and runs a SQLite metastore in the container

https://apache-airflow.slack.com/archives/CQAMHKWSJ/p1608152276070500

mik-laj commented 3 years ago

I have prepared some Dockerfiles with some common configuration.

Postgres - Redis - Airflow 2.0 ```yaml version: '3' x-airflow-common: &airflow-common image: apache/airflow:1.10.14 environment: - AIRFLOW__CORE__EXECUTOR=CeleryExecutor - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://airflow:airflow@postgres/airflow #- AIRFLOW__CELERY__RESULT_BACKEND=redis://:@redis:6379/0 - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0 - AIRFLOW__WEBSERVER__RBAC=True - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM= - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True volumes: - ./dags:/opt/airflow/dags - ./airflow-data/logs:/opt/airflow/logs - ./airflow-data/plugins:/opt/airflow/plugins depends_on: redis: condition: service_healthy postgres: condition: service_healthy services: postgres: image: postgres:9.5 environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - ./airflow-data/postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 30s retries: 5 restart: always redis: image: redis:latest ports: - 6379:6379 healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 30s retries: 50 restart: always airflow-webserver: << : *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 10s timeout: 10s retries: 5 restart: always airflow-scheduler: << : *airflow-common command: scheduler restart: always airflow-worker: << : *airflow-common command: celery worker restart: always airflow-init: << : *airflow-common entrypoint: /bin/bash command: - -c - airflow users list || ( airflow db init && airflow users create --role Admin --username airflow --password airflow --email airflow@airflow.com --firstname airflow --lastname airflow ) restart: on-failure flower: << : *airflow-common command: celery flower ports: - 5555:5555 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:5555/"] interval: 10s timeout: 10s retries: 5 restart: always ```
Postgres - Redis - Airflow 1.10.14 ```yaml version: '3' x-airflow-common: &airflow-common image: apache/airflow:1.10.14 environment: - AIRFLOW__CORE__EXECUTOR=CeleryExecutor - AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow - AIRFLOW__CELERY__RESULT_BACKEND=db+postgresql://airflow:airflow@postgres/airflow - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0 - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM= - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True volumes: - ./dags:/opt/airflow/dags - ./airflow-data/logs:/opt/airflow/logs - ./airflow-data/plugins:/opt/airflow/plugins depends_on: redis: condition: service_healthy postgres: condition: service_healthy services: postgres: image: postgres:9.5 environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow volumes: - ./airflow-data/postgres-db-volume:/var/lib/postgresql/data healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 30s retries: 5 restart: always redis: image: redis:latest ports: - 6379:6379 healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 30s retries: 50 restart: always airflow-webserver: << : *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 10s timeout: 10s retries: 5 restart: always airflow-scheduler: << : *airflow-common command: scheduler restart: always airflow-worker: << : *airflow-common command: worker restart: always airflow-init: << : *airflow-common entrypoint: /bin/bash command: - -c - airflow list_users || ( airflow initdb && airflow create_user --role Admin --username airflow --password airflow --email airflow@airflow.com --firstname airflow --lastname airflow ) restart: on-failure flower: << : *airflow-common command: flower ports: - 5555:5555 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:5555/healthcheck"] interval: 10s timeout: 10s retries: 5 restart: always ```
Mysql 8.0 - Redis - Airflow 2.0 ```yaml # Migrations are broken. ```
Mysql 8.0 - Redis - Airflow 1.10.14 ```yaml version: '3' x-airflow-common: &airflow-common image: apache/airflow:1.10.14 environment: - AIRFLOW__CORE__EXECUTOR=CeleryExecutor - AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql://root:airflow@mysql/airflow?charset=utf8mb4 - AIRFLOW__CORE__SQL_ENGINE_COLLATION_FOR_IDS=utf8mb3_general_ci - AIRFLOW__CELERY__BROKER_URL=redis://:@redis:6379/0 - AIRFLOW__CORE__FERNET_KEY=FB0o_zt4e3Ziq3LdUUO7F2Z95cvFFx16hU8jTeR1ASM= - AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION=True volumes: - ./dags:/opt/airflow/dags - ./airflow-data/logs:/opt/airflow/logs - ./airflow-data/plugins:/opt/airflow/plugins depends_on: redis: condition: service_healthy mysql: condition: service_healthy services: mysql: image: mysql:8.0 environment: - MYSQL_ROOT_PASSWORD=airflow - MYSQL_ROOT_HOST=% - MYSQL_DATABASE=airflow volumes: - ./airflow-data/mysql-db-volume:/var/lib/mysql ports: - "3306:3306" command: - mysqld - --explicit-defaults-for-timestamp - --default-authentication-plugin=mysql_native_password - --character-set-server=utf8mb4 - --collation-server=utf8mb4_unicode_ci healthcheck: test: ["CMD-SHELL", "mysql -h localhost -P 3306 -u root -pairflow -e 'SELECT 1'"] interval: 10s timeout: 10s retries: 5 restart: always redis: image: redis:latest ports: - 6379:6379 healthcheck: test: ["CMD", "redis-cli", "ping"] interval: 5s timeout: 30s retries: 50 restart: always airflow-webserver: << : *airflow-common command: webserver ports: - 8080:8080 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/health"] interval: 10s timeout: 10s retries: 5 restart: always airflow-scheduler: << : *airflow-common command: scheduler restart: always airflow-worker: << : *airflow-common command: worker restart: always airflow-init: << : *airflow-common entrypoint: /bin/bash command: - -c - airflow list_users || ( airflow initdb && airflow create_user --role Admin --username airflow --password airflow --email airflow@airflow.com --firstname airflow --lastname airflow ) restart: on-failure flower: << : *airflow-common command: flower ports: - 5555:5555 healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:5555/"] interval: 10s timeout: 10s retries: 5 restart: always ```

I added health checks where it was simple. Anyone have an idea for health-checks for airflow-scheduler/airflow-worker? This will improve stability.

Besides, I am planning to prepare a tool that is used to generate docker-compose files using a simple wizard. I am thinking of something similar to the Pytorch project. https://pytorch.org/get-started/locally/

Screenshot 2021-01-13 at 15 01 49
potiuk commented 3 years ago

Besides, I am planning to prepare a tool that is used to generate docker-compose files using a simple wizard. I am thinking of something similar to the Pytorch project.

Very good idea! ❤️

ldacey commented 3 years ago

Has anyone successfully gotten turbodbc installed using pip? I have had to install miniconda and use conda-forge to get turbodbc + pyarrow working correctly. This adds a little complication to my Dockerfile, although I do kind of like the conda-env.yml file approach.

@mik-laj wow, I knew I could use common environment variables but I had no idea you could also do the volumes and images, that is super clean. Any reason why you have the scheduler restart every 30 seconds like that?

ldealmei commented 3 years ago

Thank you all for the docker-compose files :) I'm sharing mine as it addresses some aspects that I couldn't find in this thread and had me spend some time on it to get it to work. These are:

@mik-laj I also have a working healthcheck on the scheduler. Not the most expressive but works.

This configuration relies on an existing and initialized database.

External database - LocalExecutor - Airflow 2.0.0 - Traefik - Dags mostly based on DockerOperator.

version: "3.7"
x-airflow-environment: &airflow-environment
  AIRFLOW__CORE__EXECUTOR: LocalExecutor
  AIRFLOW__CORE__LOAD_EXAMPLES: "False"
  AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS: "False"
  AIRFLOW__CORE__SQL_ALCHEMY_CONN: ${DB_CONNECTION_STRING}
  AIRFLOW__CORE__FERNET_KEY: ${ENCRYPTION_KEY}
  AIRFLOW__CORE__DAGS_FOLDER: /opt/airflow/sync/git/dags
  AIRFLOW__CORE__ENABLE_XCOM_PICKLING: "True"  # because of https://github.com/apache/airflow/issues/13487
  AIRFLOW__WEBSERVER__BASE_URL: https://airflow.example.com
  AIRFLOW__WEBSERVER__ENABLE_PROXY_FIX: "True"
  AIRFLOW__WEBSERVER__RBAC: "True"

services:
  traefik:
    image: traefik:v2.4
    container_name: traefik
    command:
      - --ping=true
      - --providers.docker=true
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      # HTTP -> HTTPS redirect
      - --entrypoints.web.http.redirections.entrypoint.to=websecure
      - --entrypoints.web.http.redirections.entrypoint.scheme=https
      # TLS config
      - --certificatesresolvers.myresolver.acme.dnschallenge=true
      - --certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json
      ## Comment following line for a production deployment
      - --certificatesresolvers.myresolver.acme.caserver=https://acme-staging-v02.api.letsencrypt.org/directory
      ## See https://doc.traefik.io/traefik/https/acme/#providers for other providers
      - --certificatesresolvers.myresolver.acme.dnschallenge.provider=digitalocean
      - --certificatesresolvers.myresolver.acme.email=user@example.com
    ports:
      - 80:80
      - 443:443
    environment:
      # See https://doc.traefik.io/traefik/https/acme/#providers for other providers
      DO_AUTH_TOKEN:
    restart: always
    healthcheck:
      test: ["CMD", "traefik", "healthcheck", "--ping"]
      interval: 10s
      timeout: 10s
      retries: 5
    volumes:
      - certs:/letsencrypt
      - /var/run/docker.sock:/var/run/docker.sock:ro

  # Required because of DockerOperator. For secure access and handling permissions.
  docker-socket-proxy:
    image: tecnativa/docker-socket-proxy:0.1.1
    environment:
      CONTAINERS: 1
      IMAGES: 1
      AUTH: 1
      POST: 1
    privileged: true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    restart: always

  # Allows to deploy Dags on pushes to master
  git-sync:
    image: k8s.gcr.io/git-sync/git-sync:v3.2.2
    container_name: dags-sync
    environment:
      GIT_SYNC_USERNAME:
      GIT_SYNC_PASSWORD:
      GIT_SYNC_REPO: https://example.com/my/repo.git
      GIT_SYNC_DEST: dags
      GIT_SYNC_BRANCH: master
      GIT_SYNC_WAIT: 60
    volumes:
      - dags:/tmp:rw
    restart: always

  webserver:
    image: apache/airflow:2.0.0
    container_name: airflow_webserver
    environment:
      <<: *airflow-environment
    command: webserver
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    volumes:
      - dags:/opt/airflow/sync
      - logs:/opt/airflow/logs
    depends_on:
      - git-sync
      - traefik
    labels:
      - traefik.enable=true
      - traefik.http.routers.webserver.rule=Host(`airflow.example.com`)
      - traefik.http.routers.webserver.entrypoints=websecure
      - traefik.http.routers.webserver.tls.certresolver=myresolver
      - traefik.http.services.webserver.loadbalancer.server.port=8080

  scheduler:
    image: apache/airflow:2.0.0
    container_name: airflow_scheduler
    environment:
      <<: *airflow-environment
    command: scheduler
    restart: always
    healthcheck:
      test: ["CMD-SHELL", 'curl --silent http://airflow_webserver:8080/health | grep -A 1 scheduler | grep \"healthy\"']
      interval: 10s
      timeout: 10s
      retries: 5
    volumes:
      - dags:/opt/airflow/sync
      - logs:/opt/airflow/logs
    depends_on:
      - git-sync
      - webserver

volumes:
  dags:
  logs:
  certs:

I have an extra container (not shown) to handle rotating logs that are output directly to files. It is based on logrotate. Not sharing it here because it is a custom image and is beyond the scope of the thread. But if anybody interested, message me.

Hope it helps!

mik-laj commented 3 years ago

I added some improvements to the docker-compose file to make it more stable. https://github.com/apache/airflow/pull/14519 https://github.com/apache/airflow/pull/14522 Now we have health-checks for all components.

kaxil commented 3 years ago

@mik-laj Can we close this one since we already added the docker-compose files?

potiuk commented 3 years ago

@kaxil -> I believe so. I do not think 'production-ready" docker-compose is even a thing :)