asandeep / airflow-ecr-plugin

Airflow AWS ECR integration
Apache License 2.0
10 stars 1 forks source link

Airflow AWS ECR Plugin

Build Status codecov Python Versions Package Version Black

This plugin exposes an operator that refreshes ECR login token at regular intervals.

About

Amazon ECR is a AWS managed Docker registry to host private Docker container images. Access to Docker repositories hosted on ECR can be controlled with resource based permissions using AWS IAM.

To push/pull images, Docker client must authenticate to ECR registry as an AWS user. An authorization token can be generated using AWS CLI get-login-password command that can be passed to docker login command to authenticate to ECR registry. For instructions on setting up ECR and obtaining login token to authenticate Docker client, click here.

The authorization token obtained using get-login-password command is only valid for 12 hours and Docker client needs to authenticate with fresh token after every 12 hours to make sure it can access Docker images hosted on ECR. Moreover, ECR registries are region specific and separate token should be obtained to authenticate to each registry.

The whole process can be quite cumbersome when combined with Apache Airflow. Airflow's DockerOperator accepts docker_conn_id parameter that it uses to authenticate and pull images from private repositories. In case this private registry is ECR, a connection can be created with login token obtained from get-login-password command and the corresponding ID can be passed to DockerOperator. However, since the token is only valid for 12 hours, DockerOperator will fail to fetch images from ECR once token is expired.

This plugin implements RefreshEcrDockerConnectionOperator Airflow operator that can automatically update the ECR login token at regular intervals.

Installation

Pypi

pip install airflow-ecr-plugin

Poetry

poetry add airflow-ecr-plugin@latest

Getting Started

Once installed, plugin can be loaded via setuptools entrypoint mechanism.

Update your package's setup.py as below:

from setuptools import setup

setup(
    name="my-package",
    ...
    entry_points = {
        'airflow.plugins': [
            'aws_ecr = airflow_ecr_plugin:AwsEcrPlugin'
        ]
    }
)

If you are using Poetry, plugin can be loaded by adding it under [tool.poetry.plugin."airflow.plugins"] section as below:

[tool.poetry.plugins."airflow.plugins"]
"aws_ecr" = "airflow_ecr_plugin:AwsEcrPlugin"

Once plugin is loaded, same will be available for import in python modules.

Now create a DAG to refresh ECR tokens,

from datetime import timedelta

import airflow
from airflow.operators import aws_ecr

DEFAULT_ARGS = {
    "depends_on_past": False,
    "retries": 0,
    "owner": "airflow",
}

REFRESH_ECR_TOKEN_DAG = airflow.DAG(
    dag_id="Refresh_ECR_Login_Token",
    description=(
        "Fetches the latest token from ECR and updates the docker "
        "connection info."
    ),
    default_args=DEFAULT_ARGS,
    schedule_interval=<token_refresh_interval>,
    # Set start_date to past date to make sure airflow picks up the tasks for
    # execution.
    start_date=airflow.utils.dates.days_ago(2),
    catchup=False,
)

# Add below operator for each ECR connection to be refreshed.
aws_ecr.RefreshEcrDockerConnectionOperator(
    task_id=<task_id>,
    ecr_docker_conn_id=<docker_conn_id>,
    ecr_region=<ecr_region>,
    aws_conn_id=<aws_conn_id>,
    dag=REFRESH_ECR_TOKEN_DAG,
)

Placeholder parameters in above code snippet are defined below:

Known Issues

If you are running Airflow v1.10.7 or earlier, the operator will fail due to: AIRFLOW-3014.

The work around is to update Airflow connection table password column length to 5000 characters.

Acknowledgements

The operator is inspired from Brian Campbell's post on Using Airflow's Docker operator with ECR.