aws-samples / mwaa-disaster-recovery

Disaster recovery solution for Amazon Managed Workflows for Apache Airflow (MWAA)
https://github.com/aws-samples/mwaa-disaster-recovery
MIT No Attribution
5 stars 5 forks source link
airflow disaster dr mwaa recovery

MWAA Disaster Recovery

MWAA CDK Python Black CodeCoverage PyPI version

Contents

Introduction

Amazon Managed Workflow for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow. An MWAA deployment comes with meaningful defaults such as multiple availability zone (AZ) deployment of Airflow schedulers and auto-scaling of Airflow workers across multiple AZs, all of which can help customers minimize the impact of an AZ failure. However, a regional large scale event (LSE) can still adversely affect business continuity of critical workflows running on an MWAA environment. To minimize the impact of LSEs, a multi-region architecture is needed that automatically detects service disruption in the primary region and automates cut-over to the secondary region. This project offers an automated-solution for two key disaster recovery strategies for MWAA: Backup Restore and Warm Standby. Let's review the solution architectures and dive-deep into the two strategies next.

This solution is a part of an AWS blog series on MWAA Disaster Recovery. Please review both Part 1 and Part 2 blog series before diving into the details of the solution.

[!NOTE] The project currently supports the following versions of MWAA:

  • 2.8.1
  • 2.7.2
  • 2.6.3
  • 2.5.1
  • 2.4.3

Architecture

In this section, we will discuss two highly resilient, multi-region deployment architectures for MWAA. These architectures can achieve recovery time and recover point objectives of minutes (Warm Standby) to an hour (Backup and Restore) based on volume of historical data to be backed up and restored. Let's discuss the the two strategies in details next.

Backup and Restore

The general idea behind the backup and restore approach is to have the MWAA environment running in the primary region periodically backup its metadata to an S3 bucket in that region, sync the metadata to the secondary region's S3 bucket, and eventually use the backed up metadata to recreate an identical environment in the secondary region when the primary region fails. This approach can afford an RTO of 30+ minutes depending on the size of metadata to be restored. We assume that you have a running MWAA environment with the associated S3 bucket for hosting DAGs to start with. There are two key workflows to consider in this architecture as shown in the diagram below:

Backup and Restore

BR Flow 1: Periodic Backup

In order to recreate a new environment in secondary region when the primary environment fails, you have to maintain a backup of the primary metadata store. Flow 1 involves an Airflow DAG to take backup of the metadata tables and store them on S3 bucket to restore the MWAA state in secondary region when needed.

BR Flow 2: Recovery

The BR Flow 1 helps with backing up the state of the primary MWAA environment. The Flow 2 detects a failure in the primary environment and triggers the recovery of the MWAA environment in the secondary region. The recover involves creating a new MWAA environment from the stored configuration of the primary environment (as a part of Flow 2) and eventually rehydrating the new environment with the metadata backed up from the primary environment (Flow 1).

Warm Standby

In the warm standby approach, we start with two identical MWAA environments, one in the primary and the other in the secondary region. The metadata in the primary region is backed up in an S3 bucket with cross-region replication to a secondary region bucket. In case of the primary MWAA environment failure, the backed up metadata is restored in the secondary MWAA to restart the DAG workflows in the secondary region. Since the MWAA environment is already created/warm in the secondary region, this approach can achieve recovery time objective of 5+ minutes depending on the amount of metadata to be restored. There are two key workflows in this architecture as shown in the diagram below:

Warm Standby

WS Flow 1: Periodic Backup

In order to restore the primary MWAA environment in the secondary region, you have to maintain a backup of the primary metadata store. Flow 1 involves an Airflow DAG to take backup of the metadata tables and store them in an S3 bucket.

WS Flow 2: Recovery

As discussed in the previous sections, WS Flow 1 helps backup the metadata of the primary MWAA environment. Flow 2, on the other hand, detects a failure in the primary environment and triggers the recovery of the MWAA environment in the secondary region. The recover involves rehydrating the standby secondary environment with the metadata backed up from the primary environment.

Solution

The lib folder hosts the deployment code for the project. The project performs multi-region deployment of two stacks:

Prerequisites

Software Requirements

python NodeJS >= v14 AWS CDK v2 Docker Latest

AWS Resources Needed Pre-Deployment

Also add the following trust policy to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "airflow.amazonaws.com",
                    "airflow-env.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Stack Parameters

The parameters for the solution are externalized as environment variables. You can specify these parameters as environment variables in your CICD pipeline or create a .env file with appropriate key and values at the root of this project for a deployment from your machine. You can find more details in the implementation sections BR-3: Setup Environment Variables and WS-3: Setup Environment Variables. Let's review the required parameters first followed by the optional ones.

Required Parameters

Here are the required parameters that applies to both primary and secondary region stacks:

Variable Name Example Values Description
AWS_ACCOUNT_ID 111222333444 Your AWS account id.
DR_TYPE BACKUP_RESTORE, WARM_STANDBY The disaster recovery strategy to be deployed.
MWAA_UPDATE_EXECUTION_ROLE YES or NO Flag to denote whether to update the existing MWAA execution role with new policies for allowing task token return calls from the StepFunctions workflow in the secondary stack. See the Automated Updates to the Execution Role for details.
MWAA_VERSION 2.4.3, 2.5.1, 2.6.3, 2.7.2, 2.8.1 The deployed version of MWAA.
PRIMARY_DAGS_BUCKET_NAME mwaa-2-5-1-primary-bucket The name of the DAGs S3 bucket used by the environment in the primary region.
PRIMARY_MWAA_ENVIRONMENT_NAME mwaa-2-5-1-primary The name of the MWAA environment in the primary region.
PRIMARY_MWAA_ROLE_ARN arn:aws:...:role/service-role/primary-role The ARN of the execution role used by the primary MWAA environment.
PRIMARY_REGION us-east-1, us-east-2, ... The primary AWS region.
PRIMARY_SECURITY_GROUP_IDS '["sg-0123456789"]' The IDs of the security groups used by the primary MWAA environment. Note that the brackets, [], are necessary to denote a list even for a single element list.
PRIMARY_SUBNET_IDS '["subnet-1234567", "subnet-987654321"]' The ID of the VPC subnets where the primary MWAA environment is deployed. Note that the brackets, [], are necessary to denote a list even for a single element list.
PRIMARY_VPC_ID vpc-012ab34c56d789101 The ID of the VPC where the primary MWAA environment is deployed.
SECONDARY_CREATE_SFN_VPCE YES or NO Flag to denote whether to create a VPC endpoint for Step Functions. The VPCE is particularly important for MWAA running in private mode, where workers may not have internet access to send task token response to the Step Functions orchestrating the restore workflow. If NO is chosen, then you will need to manually create the VPC endpoint. Enabling this flag may modify your VPC's security group. See the Automated Update to the VPC Security Group for details.
SECONDARY_DAGS_BUCKET_NAME mwaa-2-5-1-secondary-bucket The name of the S3 DAGs bucket used by the environment in the secondary region.
SECONDARY_MWAA_ENVIRONMENT_NAME mwaa-2-5-1-secondary The name of the MWAA environment in the secondary region.
SECONDARY_MWAA_ROLE_ARN arn:aws:...:role/service-role/secondary-role The ARN of the execution role used by the secondary MWAA environment.
SECONDARY_REGION us-west-1, us-west-2, ... The secondary AWS region for diaster recovery.
SECONDARY_SECURITY_GROUP_IDS '["sg-0123456789"]' The IDs of the security groups used by the secondary MWAA environment. Note that the brackets, [], are necessary to denote a list even for a single element list.
SECONDARY_SUBNET_IDS '["subnet-1234567", "subnet-987654321"]' The ID of the VPC subnets in the secondary region where the MWAA environment is deployed. Note that the brackets, [], are necessary to denote a list even for a single element list.
SECONDARY_VPC_ID vpc-012ab34c56d789101 The ID of the VPC where the secondary MWAA environment is deployed.
STACK_NAME_PREFIX mwaa-2-5-1-data-team A name prefix for the deployment stacks. This prefix will be used for primary and secondary stacks as well as their resources.

Optional Parameters

Here are the optional parameters that applies to both primary and secondary region stacks:

Variable Name Default Value Example Values Description
DR_CONNECTION_RESTORE_STRATEGY APPEND DO_NOTHING, APPEND, or REPLACE The strategy to use to restore the connection table during recovery workflow. Review Special Handling of Variable and Connection Tables for details.
DR_VARIABLE_RESTORE_STRATEGY APPEND DO_NOTHING, APPEND, or REPLACE The strategy to use to restore the variable table during recovery workflow. Review Special Handling of Variable and Connection Tables for details.
HEALTH_CHECK_ENABLED YES YES or NO Whether to enable periodic health check of the primary MWAA environment from the secondary region. If set to NO the, primary region failure will go undetected and the onus is on admins to manually trigger the recovery workflow.
HEALTH_CHECK_INTERVAL_MINS 5 time interval in minutes Health check frequency of the primary mwaa environment in mins.
HEALTH_CHECK_MAX_RETRY 2 number The maximum number of retries after the health check of the primary region MWAA fails before moving on to the disaster recovery flow.
HEALTH_CHECK_RETRY_BACKOFF_RATE 2 number Health check retry exponential backoff rate (exponential backoff common ratio).
HEALTH_CHECK_RETRY_INTERVAL_SECS 5 time interval in seconds Health check retry interval (exponential backoff coefficient) on failure.
METADATA_CLEANUP_DAG_NAME cleanup_metadata a dag name Name of the DAG that cleans up metadata store.
METADATA_EXPORT_DAG_NAME backup_metadata a dag name Name of the DAG that exports metadata.
METADATA_IMPORT_DAG_NAME restore_metadata a dag name Name of the DAG that imports metadata.
MWAA_BACKUP_FILE_NAME environment.json a json file name Name of the file (json) to used for storing environment details in the backup S3 bucket.
MWAA_CREATE_ENV_POLLING_INTERVAL_SECS 60 interval in seconds Wait time before checking status of the MWAA environment in the polling loop during creation.
MWAA_DAGS_S3_PATH dags path/to/dags Path to the folder in the DAGs S3 bucket where DAGs are deployed.
MWAA_NOTIFICATION_EMAILS [] '["ad@eg.com"]', '["ad@eg.com", "ops@eg.com"]' Comma separated list of emails. Note that the brackets, [], are necessary to denote a list even for a single element list.
MWAA_SIMULATE_DR NO YES or NO Whether to simulate a DR by artificially forcing health check failure for the MWAA environment in the primary region. Only use for testing.
PRIMARY_BACKUP_SCHEDULE '0 * * * *' @hourly, @daily, or any cron expressions Cron schedule for taking backup of the metadata store.
PRIMARY_REPLICATION_POLLING_INTERVAL_SECS 30 wait time in seconds The polling internal in secs for checking the status of the one time replication job during primary stack deployment.
SECONDARY_CLEANUP_COOL_OFF_SECS 30 wait time in seconds The cool of time in secs between the metadata store cleanup operation and the restore operation in the recovery workflow.
STATE_MACHINE_TIMEOUT_MINS 60 timeout in minutes The restore Step Fuctions workflow timeout in minutes.

Automated Updates to the Execution Role

Note that the secondary region stack will add an additional policy statement to the MWAA execution role for the secondary region if the configuration parameter MWAA_UPDATE_EXECUTION_ROLE is set to YES. If you intend to set this parameter to NO, then please add the following policy entry to the secondary MWAA execution role:

{
    "Effect": "Allow",
    "Action": [
        "states:SendTaskFailure",
        "states:SendTaskHeartbeat",
        "states:SendTaskSuccess"
    ],
    "Resource": ["arn:aws:states:*:<account>:stateMachine:*"]
}

Automated Update to the VPC Security Group

Note that if you supplied a VPC security group for your MWAA environment and if the security group does not allow inbound HTTPS traffic (port 443) originating from within the VPC CIDR range, then the stack will add a new rule to the security group to do so. The HTTPS traffic is required for the use of StepFunctions interface endpoint that make the StepFunctions accessible to your private network through AWS PrivateLink.

Step-By-Step Deployment Guide

The project uses Cloud Development Kit (CDK) and is set up like a standard Python project. Assuming that you have AWS credentials for deploying the project setup for your command shell, follow these steps to build and deploy the solution to your AWS account.

Bootstrap Your AWS Account

If you account has not been setup to use CDK yet, you will need to perform a one time cdk bootstrapping for both primary and secondary regions using the following command: cdk bootstrap cdk bootstrap aws://<account>/<primary-region> aws://<account>/<secondary-region>. Here's an example:

cdk bootstrap aws://123456789999/us-east-1 aws://123456789999/us-east-2

Clone the Project

Let's clone the project in your local machine as follows:

git clone https://github.com/aws-samples/mwaa-disaster-recovery.git
cd mwaa-disaster-recovery

This deployment guide walks through first deploying the stack in backup and restore mode followed by warm standby.

Backup and Restore Tutorial

BR-1: Create Necessary AWS Resources

If you already don't have an MWAA environment, use the quickstart guide or follow these steps to create a new MWAA environment:

  1. Create a S3 bucket with versioning enabled on AWS console in your primary region, let's call it mwaa-2-5-1-primary-source (you will probably need to specify a different name as S3 bucket name must be globally unique).
  2. Assuming you will name your primary mwaa environment mwaa-2-5-1-primary, create an IAM role as documented in the AWS Resources pre-requisites section.
  3. Create an MWAA environment on AWS console and using S3 bucket and execution role that you created in steps 1 and 2. Choose default VPC, subnets, and security group.
  4. Similarly create another S3 bucket (with versioning enabled) and IAM role in your secondary region.

BR-2: Setup Local Virtual Environment

You will need a virtualenv created within the project, which is stored under the .venv directory. To create the virtualenv, it assumes that there is a python3 executable in your path with access to the venv package. Create your virtualenv as follows:

python3 -m venv .venv

Next, you will need to activate your virtual environment.

MacOS / Linux:

source .venv/bin/activate

Windows:

.venv\Scripts\activate.bat

Once the virtualenv is activated, you will need to install the required dependencies:

pip install -r requirements.txt
pip install -r requirements-dev.txt

BR-3: Setup Environment Variables

Create a .env file at the root of the project by copying the following contents and making appropriate changes. The configuration parameters are explained in the stack parameters section.

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
DR_TYPE=BACKUP_RESTORE

MWAA_VERSION=2.5.1
MWAA_UPDATE_EXECUTION_ROLE=YES

PRIMARY_REGION=us-east-1
PRIMARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-primary
PRIMARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-primary-role
PRIMARY_DAGS_BUCKET_NAME=mwaa-2-5-1-primary-source
PRIMARY_VPC_ID=vpc-012ab34c56d789101
PRIMARY_SUBNET_IDS='["subnet-1234567", "subnet-987654321"]'
PRIMARY_SECURITY_GROUP_IDS='["sg-0123456789"]'

SECONDARY_REGION=us-east-2
SECONDARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-secondary
SECONDARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-secondary-role
SECONDARY_DAGS_BUCKET_NAME=mwaa-2-5-1-secondary-source
SECONDARY_VPC_ID=vpc-1111222233334444
SECONDARY_SUBNET_IDS='["subnet-2222222", "subnet-3333333"]'
SECONDARY_SECURITY_GROUP_IDS='["sg-111222333444"]'
SECONDARY_CREATE_SFN_VPCE=YES

BR-4: Build the Project

At this point you can now synthesize the CloudFormation template for this code:

cdk synth

You can also see what stacks and resources get created by typing:

cdk diff

BR-5: Deploy the Solution

Now you are ready to deploy the stacks, the following deploys both primary and the secondary region stacks:

cdk deploy --all

BR-6: Explore the Airflow UI

From MWAA console, explore the Airflow UI, it should have the following DAGs available:

Feel free to upload additional dags and play around to generate some metadata for the backup restore process. Here is a sample dag that you can upload in the dags folder of your DAGs S3 bucket.

BR-7: Manual Backup

While the backups are taken automatically based on the supplied schedule, you can manually trigger the backup_metadata DAG to force generate backup in the corresponding backup (not DAGs) S3 bucket. Explore the data folder in the backup S3 bucket to review the CSV dump generated by the backup DAG. The stack automatically replicates both DAGs and backup S3 buckets to the secondary region.

BR-8: Simulate DR for Testing

You can simulate a DR situation by enabling the MWAA_SIMULATE_DR parameter in your .env file as follows:

MWAA_SIMULATE_DR=YES

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...

Now re-deploy the project:

cdk deploy --all

BR-9: Monitor the DR StepFunctions Workflow

On AWS console, monitor the StepFunction deployed as a part of the secondary region stack, which will orchestrate creating a new environment in the secondary region and eventually restoring the backup data to the newly created environment.

The Airflow UI should show past DAG runs as well as logs, variables, and connections restored from the primary MWAA environment. All the active DAGs in the primary region should also be active in the secondary region.

Warm Standby Tutorial

WS-1: Create Necessary AWS Resources

The Warm Standby approach will need an MWAA environment each in two AWS regions. If you already don't have MWAA environments, use the quickstart guide or follow these steps to create a new MWAA environment each in the two AWS regions:

  1. Create a S3 bucket with versioning enabled on AWS console in your primary region, let's call it mwaa-2-5-1-primary-source (you will probably need to specify a different name as S3 bucket name must be globally unique). Similarly, create another bucket with versioning enabled in the secondary region, let's call it mwaa-2-5-1-secondary-source bucket.
  2. Assuming you will name your primary mwaa environment mwaa-2-5-1-primary, create an IAM role as documented in the AWS Resources pre-requisites section. You will need two roles, one for each MWAA environment in the two regions.
  3. Create an MWAA environment on AWS console and using S3 bucket and execution role that you created in steps 1 and 2. Choose default VPC, subnets, and security group in the primary region. Similarly, you will need to create another environment in the secondary MWAA region.

WS-2: Setup Local Virtual Environment

You will need a virtualenv created within the project, which is stored under the .venv directory. To create the virtualenv, it assumes that there is a python3 executable in your path with access to the venv package. Create your virtualenv as follows:

python3 -m venv .venv

Next, you will need to activate your virtual environment.

MacOS / Linux:

source .venv/bin/activate

Windows:

.venv\Scripts\activate.bat

Once the virtualenv is activated, you can install the required dependencies:

pip install -r requirements.txt
pip install -r requirements-dev.txt

WS-3: Setup Environment Variables

Create a .env file at the root of the project by copying the following contents and making appropriate changes. The configuration parameters are explained in the stack parameters section.

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
DR_TYPE=WARM_STANDBY

MWAA_VERSION=2.5.1
MWAA_UPDATE_EXECUTION_ROLE=YES

PRIMARY_REGION=us-east-1
PRIMARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-primary
PRIMARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-primary-role
PRIMARY_DAGS_BUCKET_NAME=mwaa-2-5-1-primary-source
PRIMARY_VPC_ID=vpc-012ab34c56d789101
PRIMARY_SUBNET_IDS='["subnet-1234567", "subnet-987654321"]'
PRIMARY_SECURITY_GROUP_IDS='["sg-0123456789"]'

SECONDARY_REGION=us-east-2
SECONDARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-secondary
SECONDARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-secondary-role
SECONDARY_DAGS_BUCKET_NAME=mwaa-2-5-1-secondary-source
SECONDARY_VPC_ID=vpc-1111222233334444
SECONDARY_SUBNET_IDS='["subnet-2222222", "subnet-3333333"]'
SECONDARY_SECURITY_GROUP_IDS='["sg-111222333444"]'
SECONDARY_CREATE_SFN_VPCE=YES

WS-4: Build the Project

At this point you can now synthesize the CloudFormation template for this code:

cdk synth

You can also see what stacks and resources get created by typing:

cdk diff

WS-5: Deploy the Solution

Now you are ready to deploy the stacks, the following deploys both primary and the secondary region stacks:

cdk deploy --all

WS-6: Explore the Airflow UI

From MWAA console, explore the Airflow UI, it should have the following DAGs available:

Feel free to upload additional dags and play around to generate some metadata for the backup restore process. Here is a sample dag that you can upload in the dags folder of your DAGs S3 bucket.

WS-7: Manual Backup

While the backups are taken automatically based on the supplied schedule, you can manually trigger the backup_metadata DAG to force generate backup in the corresponding backup (not DAGs) S3 bucket. Explore the data folder in the backup S3 bucket to review the CSV dump generated by the backup DAG. The stack automatically replicates both DAGs and backup S3 buckets to the secondary region.

WS-8: Simulate DR for Testing

You can simulate a DR situation by setting the MWAA_SIMULATE_DR parameter in your .env file as follows:

MWAA_SIMULATE_DR=YES

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...

Now re-deploy the project:

cdk deploy --all

WS-9: Monitor the DR StepFunctions Workflow

On AWS console, monitor the StepFunction deployed as a part of the secondary region stack, which will orchestrate restoring the backup data in the existing secondary MWAA environment.

The Airflow UI should show past DAG runs as well as logs, variables, and connections restored from the primary MWAA environment. All the active DAGs in the primary region should also be active in the secondary region.

Clean Up

You can clean up the resources deployed through this solution by simply deleting the stacks as follows:

cdk destroy --all

[!CAUTION] Destroying the stacks will also delete the backup S3 buckets in both primary and secondary regions. DAGs S3 buckets in both region will remain intact and the dags/mwaa_dr folder in both buckets will need to be manually deleted. For the backup restore strategy, environment created as a result of the restore workflow in the secondary region will also need to be manually deleted either on AWS Console or through AWS CLI.

Limitations and Special Cases

The project offers a custom solution to address the disaster recovery needs for Amazon MWAA. Since it is a non-native solution, there are some important limitations to be aware of as follows:

Data Loss Probability

The project only takes metadata backup of the tasks that are not actively running in the primary environment, i.e, it excludes task instances in any of [running, restarting, queued, scheduled, up_for_retry, and up_for_reschedule] states. Hence, the solution cannot restore an actively running DAG in the secondary environment. If the primary environment fails while actively running some DAGs, then those DAGs will restart at the specified next schedules after cut over to the secondary environment. If those DAGs do not have schedules specified, then the admins will need to manually trigger them in the secondary location.

[!CAUTION] As a side-effect of the aforementioned strategy, the metadata of the most recently run backup_metadata DAG will be excluded from the backup, as the DAG will be in active state when its taking backup of the metadata.

[!IMPORTANT] Note that, by default, the solution backs up only variable, connection, slot_pool, log, job, dag_run, trigger, task_instance, task_fail, and xcom tables. Majority of other tables are auto-generated by scheduler or by the webserver and thus, excluded from the list of tables to be backed up. You can add/remove the tables to be backed by simply returning a custom list in the dr_factory.setup_tables() method corresponding to your MWAA version in the codebase. By default, all DR factories are chained by class inheritance with the base class, DRFactory_2_5.

Special Handling of Variable and Connection Tables

The most recent backup of the primary environment will always override the metadata of the secondary environment except for the variable and connection tables. These tables may need to be handled specially and the solution supports three different restore strategies for them as follows:

  1. DO_NOTHING: As the name suggests, this strategy will not restore the variable and connection tables from the backup. This strategy is particularly useful if your MWAA environments have been configured to use AWS Secrets Manager for storing variables and connections, particularly, applicable for the warm standby deployment.

  2. APPEND: In many cases, the secondary Amazon MWAA environment will likely need to interact with different data sources and web services running in the secondary region. Hence, with this strategy, the restore workflow will not overwrite existing entries of the variable and connection tables in the secondary MWAA environment from the backup. This is the default strategy for the warm standby deployment.

  3. REPLACE: This strategy can be used to overwrite existing variable and connections from backup. This is the default strategy for the backup and restore deployment.

The solution automatically reads these configuration from your .env file or environment variables during deployment. To change the default restore behavior for variable and connection tables, you will need to supply an appropriate value for DR_VARIABLE_RESTORE_STRATEGY and DR_CONNECTION_RESTORE_STRATEGY, respectively. Here is an example .env file for a warm standby deployment:

DR_VARIABLE_RESTORE_STRATEGY=DO_NOTHING
DR_CONNECTION_RESTORE_STRATEGY=DO_NOTHING

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...

[!NOTE] Please note that the back restore deployment only supports DO_NOTHING and REPLACE strategies, where as the warm standby deployment supports all three.

[!IMPORTANT] For using the mwaa-dr framework independent to the DR solution, you will need to similarly set DR_VARIABLE_RESTORE_STRATEGY and DR_CONNECTION_RESTORE_STRATEGY Airflow variables. Note that these two Airflow variables are treated specially and are unaffected by the restore process. In their absence, the default value of APPEND is used during the restore workflow in the independent use.

Clean Metadata Tables Required for the Restore Workflow

The solution backs up variable, connection, slot_pool, log, job, dag_run, trigger, task_instance, task_fail, and xcom tables by default during the backup workflow in the primary region. If any of these tables are non-empty during a recovery workflow in the secondary region, then you will encounter database key constraint violations in the metadata store. To avoid this issue, the Warm Standby worfklow automatically cleans up the secondary region MWAA metadata using the clean_metadata DAG during execution.

Manually Triggering the Recovery Workflow

There might be an organizational need to manually trigger the recovery workflow rather than relying on the Amazon EventBridge schedule that runs the health check (and the recovery workflow when the health check fails) every 5 mins by default. To disable this periodic health check and automated recovery flow, set the HEALTH_CHECK_ENABLED environment variable to NO in the .env file locally or in environment variable configurations of your CI/CD pipeline. Here is a sample snippet of the expected .env file:

HEALTH_CHECK_ENABLED=NO

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
# ... elided for brevity

To manually trigger the recovery workflow, find the Step Functions workflow in the secondary region stack and start the new execution by supplying the following input:

{
  "simulate_dr": "YES"
}

This is also a great way to manually test your disaster recovery setup!

Using the Metadata Backup and Restore DAGs Independently

There might be a need where you only want to perform backup and restore operations without the full DR solution. You can run the backup and restore independently in two modes:

For production use in a public web server mode, we recommend using the published mwaa_dr library to create the necessary DAG for backup and restore in your MWAA environment.

For a private webserver mode, you can copy the assets/dags/mwaa_dr folder to your S3's dags folder. Also, copy the contents of requirements.txt to the MWAA requirements file.

For both modes, please make sure of the following:

  1. Ensure you have an S3 bucket created to store the backup.
  2. Ensure that your MWAA execution role has read and write permissions on the bucket.
  3. Create an Airflow variable with the key named DR_BACKUP_BUCKET and the value containing the name (not ARN) of the S3 bucket.
  4. You are all set to manually trigger the backup and restore DAGs at any point. The metadata backup will be stored in <backup S3 bucket>/<path_prefix>.

For testing the mwaa_dr library itself, you can run aws-mwaa-local-runner container locally by simply copying the assets/dags/mwaa_dr folder into the dags folder of the local runner codebase. Also, copy the contents of requirements.txt to the local runner's requirements file. Finally, export an Airflow variable in the startup_script/startup.sh file of the local runner as follows:

export AIRFLOW_VAR_DR_STORAGE_TYPE=LOCAL_FS

After the setup you are all set to run the backup_metadata and restore_metadata dags. The metadata will be stored and restored from/to the dags/data/ folder of the mwaa-lcoal-runner codebase.

[!IMPORTANT] Note that this is a great way to test support for a new version of MWAA.

May Need to Restart Environment for Plugins to Work

If you have plugins that rely on variables and connections, particularly, for the Backup Restore approach, you may need to manually restart the MWAA environment after the restore is complete for the solution to work. The plugins get loaded in the secondary MWAA environment immediately after it is created before the variables and connections can be restored, thus, breaking your plugins dependencies. Restarting the environment will help mitigate this issue.

Frequently Asked Questions

This section documents some of the frequently asked questions around the solutions:

FAQ-1: Failure to Read Environment Backup

Question: I am trying to test the Backup and Restore DR solution. I have set MWAA_SIMULATE_DR=YES, but I am getting S3.S3Exception with status code 403 - Access Denied in the Read Environment Backup state as follows:

Answer: For the restore workflow to work, you must have one successful run of the workflow that follows the alternative path after the Check Heartbeat state, where it gets the primary environment details (Get Environmnent Details) and stores the configuration in S3 (Store Environment Details). In the absence of this configuration file, the Read Environment Backup state will fail with error.

To resolve this issue, redeploy your stack with MWAA_SIMULATE_DR=NO and wait for the workflow to finish successfully. This run will store the primary environment configuration in the secondary backup S3 bucket. Now redeploy your stack with MWAA_SIMULATE_DR=YES.

FAQ-2: Failure to Create New Environment

Question: I am trying to test the Backup and Restore DR solution. I have set MWAA_SIMULATE_DR=YES, but I am getting the following ValidationException in the Create New Environment state:

An error occurred (ValidationException) when calling the CreateEnvironment operation: Unable to access version <version-string-secondary> of <secondary-region-dags-bucket>/requirements.txt

This issue occurs when the version of requirements.txt file in the secondary region DAGs bucket does not match that of the primary region DAGs bucket.

To resolve this issue, please follow these steps:

  1. Modify the create_replication_job_custom_resource function in mwaa_primary_stack to replace on_create with on_update.
  2. Redeploy your stack with MWAA_SIMULATE_DR=NO and wait for the StepFunctions workflow in the secondary region stack to finish successfully at least once. This will ensure that the latest primary environment configuration is stored in the secondary region backup bucket for future use.
  3. Enable the Event Bridge schedule if it's in disabled state from your AWS console in the secondary region so the restore workflow can start again.
  4. Redeploy your stack with MWAA_SIMULATE_DR=YES, which should now pick up the right version of the requirements file from the secondary DAGs bucket.
  5. Revert the change you made to the mwaa_primary_stack by replacing on_update with on_create in the create_replication_job_custom_resource function.

The stack deployment triggers a StepFunctions workflow that replicates existing object from primary S3 DAGs bucket to the secondary bucket:

ReplicationWorkflow

Development Notes

The contributing guide explains the process of forking the project before creating a pull request. After you have cloned your forked repository locally and made some code changes, please ensure that you have run the following commands supplied in build.sh script as follows:

Lint and unit tests

python3 -m venv venv # Create venv
source ./venv/bin/activate # Activate venv
pip install -r requirements.txt # Install requirements.txt
pip install -r requirements-dev.txt # Install requirements-dev.txt
./build.sh lint # To run linting
./build.sh unit # To run unit tests

Please ensure that the code coverage has not decreased after your changes before creating a pull request.

Integration tests

Please clone the aws-mwaa-local-runner repository with:

git clone https://github.com/aws/aws-mwaa-local-runner.git
./build.sh setup <VERSION>

[!IMPORTANT]

  • You will need to have Docker running for both unit and integration tests to work.
  • You may need to enable Windows Subsystem for Linux to run the build scripts in a Windows OS.
  • Make sure port 8080 is not used by another process.
  • View the container startup with the docker logs command.
  • Monitor the containers with docker ps.
  • After the aws-mwaaa-local-runner container is up and healthy, you can access Airflow by navigating to http://localhost:8080.
  • Username/password: admin/test

Please also review the Using the Metadata Backup and Restore DAGs Independently section on how to run the backup and restore locally for testing the mwaa_dr framework.