MWAA Disaster Recovery

Introduction
Architecture
- Backup and Restore
  - BR Flow 1: Periodic Backup
  - BR Flow 2: Recovery
- Warm Standby
  - WS Flow 1: Periodic Backup
  - WS Flow 2: Recovery
Solution
- Prerequisites
  - Software Requirements
  - AWS Resources Needed Pre-Deployment
- Stack Parameters
Step-By-Step Deployment Guide
Frequently Asked Questions
- FAQ-1: Failure to Read Environment Backup
- FAQ-2: Failure to Create New Environment
Development Notes

Introduction

Amazon Managed Workflow for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow. An MWAA deployment comes with meaningful defaults such as multiple availability zone (AZ) deployment of Airflow schedulers and auto-scaling of Airflow workers across multiple AZs, all of which can help customers minimize the impact of an AZ failure. However, a regional large scale event (LSE) can still adversely affect business continuity of critical workflows running on an MWAA environment. To minimize the impact of LSEs, a multi-region architecture is needed that automatically detects service disruption in the primary region and automates cut-over to the secondary region. This project offers an automated-solution for two key disaster recovery strategies for MWAA: Backup Restore and Warm Standby. Let's review the solution architectures and dive-deep into the two strategies next.

This solution is a part of an AWS blog series on MWAA Disaster Recovery. Please review both Part 1 and Part 2 blog series before diving into the details of the solution.

[!NOTE] The project currently supports the following versions of MWAA:

2.8.1

2.7.2

2.6.3

2.5.1

2.4.3

Architecture

In this section, we will discuss two highly resilient, multi-region deployment architectures for MWAA. These architectures can achieve recovery time and recover point objectives of minutes (Warm Standby) to an hour (Backup and Restore) based on volume of historical data to be backed up and restored. Let's discuss the the two strategies in details next.

Backup and Restore

The general idea behind the backup and restore approach is to have the MWAA environment running in the primary region periodically backup its metadata to an S3 bucket in that region, sync the metadata to the secondary region's S3 bucket, and eventually use the backed up metadata to recreate an identical environment in the secondary region when the primary region fails. This approach can afford an RTO of 30+ minutes depending on the size of metadata to be restored. We assume that you have a running MWAA environment with the associated S3 bucket for hosting DAGs to start with. There are two key workflows to consider in this architecture as shown in the diagram below:

Backup and Restore

BR Flow 1: Periodic Backup

In order to recreate a new environment in secondary region when the primary environment fails, you have to maintain a backup of the primary metadata store. Flow 1 involves an Airflow DAG to take backup of the metadata tables and store them on S3 bucket to restore the MWAA state in secondary region when needed.

[1.a] Assuming you host your DAGs code in a source code repository, your CICD pipeline deploys the code changes to the S3 bucket configured to host DAGs, plugins, and the requirements file.
[1.b] For this architecture, we also assume that you have another S3 bucket deployed to the secondary region to host DAGs, plugins, and the requirements file with bucket versioning enabled. As a part of the CDK deployment of this project, we enable cross-region replication from primary to the secondary region buckets. Additionally, a StepFunctions workflow is triggered during the primary stack deployment to perform a one time replication of existing objects from primary DAGs bucket to the secondary DAGs bucket. Any new changes to the primary DAGs bucket are automatically replicated in the secondary region.
[1.c] The CDK deployment of the primary stack deploys the mwaa_dr framework package to the primary DAGs S3 bucket. This framework includes the (backup_metadata) DAG, which periodically takes backup of the metadata store. The backup interval is configurable and should be based on the recovery point objective (RPO) -- the data loss time during a failure that can be sustained by the business.
[1.d] The CDK deployment of the primary stack also creates a backup S3 bucket to store metadata tables backup. The secondary stack creates another S3 backup bucket in the secondary region. Similar to the DAGs S3 bucket, the backup bucket is also replicated to the corresponding secondary region bucket using S3 cross-region replication ensuring the backup of the Amazon MWAA metadata is available in the secondary region.

BR Flow 2: Recovery

The BR Flow 1 helps with backing up the state of the primary MWAA environment. The Flow 2 detects a failure in the primary environment and triggers the recovery of the MWAA environment in the secondary region. The recover involves creating a new MWAA environment from the stored configuration of the primary environment (as a part of Flow 2) and eventually rehydrating the new environment with the metadata backed up from the primary environment (Flow 1).

[2.a] The secondary CDK stack deploys a StepFunctions workflow in the secondary region which is periodically executed based on the supplied EventBridge schedule. The scheduled interval should be based on the RTO that you want, i.e., RTO >= EventBridgeScheduleInterval + EnvironmentCreationTime + MetadataRestoreTime.
[2.b] The workflow, using an AWS Lambda function, retrieves the SchedulerHeartBeat CloudWatch metrics from the primary MWAA environment.
[2.c.1] When the heartbeat signals from the primary MWAA environment are detected in the metrics, the workflow moves on to storing the environment configuration subflow as follows:
- [2.d] The workflow makes GetEnvironment API call through a Lambda function. The API returns the status among other configuration details of the MWAA environment.
- [2.e] The workflow stores the environment configuration in the backup bucket to be later used during recovery for creating a new MWAA environment in the secondary region (step 2.h) and ends.
[2.c.2] When the heartbeat signals from the primary MWAA environment are not detected in the metrics, the workflow moves on to the recovery subflow as follows (note that it can take up to 5 minutes to detect an environment failure):
- [2.f] As a first step of the recovery subflow, the EventBridge schedule is disabled to prevent subsequent duplicate recovery flow.
- [2.g] The environment configuration stored to backup bucket during the previous successful heartbeat (step 2.e) is read to recreate a new MWAA environment.
- [2.h] The workflow then uses a Lambda function to create a new MWAA environment using the CreateEnvironment API and waits for the supplied polling interval. The API call uses the environment configuration from the backup bucket with some changes to incorporate the settings of the secondary region such as the different S3 DAGs bucket, VPC, subnets, and security groups that are supplied as secondary stack configuration.
- [2.i] Using GetEnvironment API call in an AWS Lambda function, the workflow gets the status of the environment creation.
- [2.j.1] Until the status of the newly created MWAA environment becomes available, the workflow waits for a supplied duration and keeps polling the status in a loop.
- [2.j.2] When the MWAA environment becomes available the workflow comes out of the polling loop.
- [2.k] The workflow then restores the metadata from the S3 backup bucket into the newly created environment by triggering the restore_metadata DAG. It uses the task token integration, in which, it waits for the restore DAG to respond back with success or failure notification before completing the workflow.
- [2.l] The DAG for restoring metadata hydrates the newly created MWAA environment with the metadata stored in the backup bucket and finally, returns a success token back to the StepFunctions workflow, which successfully ends the flow.

Warm Standby

In the warm standby approach, we start with two identical MWAA environments, one in the primary and the other in the secondary region. The metadata in the primary region is backed up in an S3 bucket with cross-region replication to a secondary region bucket. In case of the primary MWAA environment failure, the backed up metadata is restored in the secondary MWAA to restart the DAG workflows in the secondary region. Since the MWAA environment is already created/warm in the secondary region, this approach can achieve recovery time objective of 5+ minutes depending on the amount of metadata to be restored. There are two key workflows in this architecture as shown in the diagram below:

Warm Standby

WS Flow 1: Periodic Backup

In order to restore the primary MWAA environment in the secondary region, you have to maintain a backup of the primary metadata store. Flow 1 involves an Airflow DAG to take backup of the metadata tables and store them in an S3 bucket.

[1.a] Assuming you host your DAGs code in a source code repository, your CICD pipeline deploys the code changes to the S3 bucket configured to host DAGs, plugins, and the requirements file.
[1.b] For this architecture, we also assume that you have another S3 bucket deployed to the secondary region to host DAGs, plugins, and the requirements file with bucket versioning enabled. As a part of the CDK deployment of this project, we enable cross-region replication from primary to the secondary region buckets. Additionally, a StepFunctions workflow is triggered during the primary stack deployment to perform a one time replication of existing objects from primary DAGs bucket to the secondary DAGs bucket. Any new changes to the primary DAGs bucket are automatically replicated in the secondary region.
[1.c] The CDK deployment of the primary stack deploys the mwaa_dr framework package to the primary DAGs S3 bucket. This framework includes the (backup_metadata) DAG, which periodically takes backup of the metadata store. The backup interval is configurable and should be based on the recovery point objective (RPO) -- the data loss time during a failure that can be sustained by the business.
[1.d] The CDK deployment of the primary stack also creates a backup S3 bucket to store metadata tables backup. The secondary stack creates another S3 backup bucket in the secondary region. Similar to the DAGs S3 bucket, the backup bucket is also replicated to the corresponding secondary region bucket using S3 cross-region replication ensuring the backup of the Amazon MWAA metadata is available in the secondary region.

WS Flow 2: Recovery

As discussed in the previous sections, WS Flow 1 helps backup the metadata of the primary MWAA environment. Flow 2, on the other hand, detects a failure in the primary environment and triggers the recovery of the MWAA environment in the secondary region. The recover involves rehydrating the standby secondary environment with the metadata backed up from the primary environment.

[2.a] The secondary CDK stack deploys a StepFunctions workflow in the secondary region which is periodically executed based on the supplied EventBridge schedule. The scheduled interval should be based on the RTO that you want, i.e., RTO >= EventBridgeScheduleInterval + MetadataRestoreTime.
[2.b] The workflow, using an AWS Lambda function, retrieves the SchedulerHeartBeat CloudWatch metrics from the primary MWAA environment.
[2.c.1] When the heartbeat signals from the primary MWAA environment are detected in the metrics, the workflow ends as no further actions are needed.
[2.c.2] When the heartbeat signals from the primary MWAA environment are not detected in the metrics, the workflow executes a recovery subflow as follows:
- [2.d] As a first step of the recovery subflow, the EventBridge schedule is disabled to prevent subsequent duplicate recovery flow.
- [2.e] The workflow then cleans up the metadata database by triggering the cleanup_metadata DAG. It uses the task token integration, in which, it waits for the cleanup DAG to respond back with success or failure notification before completing the workflow. The cleanup before restore is needed to avoid primary key constraint violations in the database.
- [2.f] The workflow then restores the metadata from the S3 backup bucket into the standby secondary environment by triggering the restore_metadata DAG. It uses the task token integration, in which, it waits for the restore DAG to respond back with success or failure notification before completing the workflow.
- [2.g] The DAG for restoring metadata hydrates the standby MWAA environment with the metadata stored in the backup bucket and finally, returns a success token back to the StepFunctions workflow, which ends the flow successfully.

Solution

The lib folder hosts the deployment code for the project. The project performs multi-region deployment of two stacks:

The Primary Region Stack
- Deploys a backup S3 bucket to the primary region
- Sets up cross region replications for both MWAA DAGs and backup S3 buckets
- Performs one time replication of the existing objects (such as DAGs, plugins, requirements file, and startup script) from primary DAGs bucket to the secondary one. Creates a S3 bucket to store the replication job manifest file and the replication CSV report.
- Deploys the mwaa_dr framework to the DAGs S3 bucket, which include DAGs for backup, restore, and cleanup of metadata store
- Deploys the Airflow CLI custom resource and associated lambda functions to setup necessary Airflow variables on the primary MWAA environment
- Deploys an SNS topic for failure notifications
The Secondary Region Stack
- Deploys a backup S3 bucket to the secondary region
- Deploys a StepFunctions workflow and associated Lambda functions to the secondary region configured with EventBridge schedule for health check of the primary region MWAA environment
- Deploys an SNS topic for workflow failure notification

Prerequisites

Software Requirements

AWS Resources Needed Pre-Deployment

An AWS account with an MWAA environment deployed to the primary region. If you don't have an environment deployed, you can do so using the quickstart guide.
For Warm Standby, another identical MWAA environment deployed to the secondary region. Note that Backup and Restore does not require a running MWAA environment in the secondary region.
DAGs S3 buckets with versioning enabled in both primary and secondary region. Copy the packages in assets/requirements.txt to requirements files in the DAGs S3 buckets if already available or upload the provided requirements file to the buckets and configure the MWAA environments to use the requirements files.
The security groups used by MWAA environment. In the secondary region, this can be the default security group of the VPC if not already defined, i.e., in the case of backup and restore. You can find the VPC, security groups, and subnet information of an existing MWAA environment on your AWS console.

An MWAA execution role each in both regions with the permission you need for your DAGs. At a minimum, please include the following permission policies (replace <region>, <account>, <mwaa-env-name>, and <dags-s3-bucket-name> with appropriate values):

{
"Version": "2012-10-17",
"Statement": [
    {
        "Effect": "Allow",
        "Action": "airflow:PublishMetrics",
        "Resource": "arn:aws:airflow:<region>:<account>:environment/<mwaa-env-name>"
    },
    {
        "Effect": "Allow",
        "Action": [
            "s3:GetObject*",
            "s3:GetBucket*",
            "s3:List*"
        ],
        "Resource": [
            "arn:aws:s3:::<dags-s3-bucket-name>",
            "arn:aws:s3:::<dags-s3-bucket-name>/*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "logs:CreateLogStream",
            "logs:CreateLogGroup",
            "logs:PutLogEvents",
            "logs:GetLogEvents",
            "logs:GetLogRecord",
            "logs:GetLogGroupFields",
            "logs:GetQueryResults"
        ],
        "Resource": [
            "arn:aws:logs:<region>:<account>:log-group:airflow-*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "logs:DescribeLogGroups",
            "cloudwatch:PutMetricData"
        ],
        "Resource": [
            "*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
            "sqs:ChangeMessageVisibility",
            "sqs:DeleteMessage",
            "sqs:GetQueueAttributes",
            "sqs:GetQueueUrl",
            "sqs:ReceiveMessage",
            "sqs:SendMessage"
        ],
        "Resource": "arn:aws:sqs:<region>:*:airflow-celery-*"
    },
    {
        "Effect": "Allow",
        "Action": [
            "kms:Decrypt",
            "kms:DescribeKey",
            "kms:GenerateDataKey*",
            "kms:Encrypt"
        ],
        "NotResource": "arn:aws:kms:*:<account>:key/*",
        "Condition": {
            "StringLike": {
                "kms:ViaService": [
                    "sqs.<region>.amazonaws.com"
                ]
            }
        }
    }
]
}

Also add the following trust policy to the role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "airflow.amazonaws.com",
                    "airflow-env.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Stack Parameters

The parameters for the solution are externalized as environment variables. You can specify these parameters as environment variables in your CICD pipeline or create a .env file with appropriate key and values at the root of this project for a deployment from your machine. You can find more details in the implementation sections BR-3: Setup Environment Variables and WS-3: Setup Environment Variables. Let's review the required parameters first followed by the optional ones.

Required Parameters

Here are the required parameters that applies to both primary and secondary region stacks:

Variable Name	Example Values	Description
`AWS_ACCOUNT_ID`	`111222333444`	Your AWS account id.
`DR_TYPE`	`BACKUP_RESTORE`, `WARM_STANDBY`	The disaster recovery strategy to be deployed.
`MWAA_UPDATE_EXECUTION_ROLE`	`YES` or `NO`	Flag to denote whether to update the existing MWAA execution role with new policies for allowing task token return calls from the StepFunctions workflow in the secondary stack. See the Automated Updates to the Execution Role for details.
`MWAA_VERSION`	`2.4.3`, `2.5.1`, `2.6.3`, `2.7.2`, `2.8.1`	The deployed version of MWAA.
`PRIMARY_DAGS_BUCKET_NAME`	`mwaa-2-5-1-primary-bucket`	The name of the DAGs S3 bucket used by the environment in the primary region.
`PRIMARY_MWAA_ENVIRONMENT_NAME`	`mwaa-2-5-1-primary`	The name of the MWAA environment in the primary region.
`PRIMARY_MWAA_ROLE_ARN`	`arn:aws:...:role/service-role/primary-role`	The ARN of the execution role used by the primary MWAA environment.
`PRIMARY_REGION`	`us-east-1`, `us-east-2`, ...	The primary AWS region.
`PRIMARY_SECURITY_GROUP_IDS`	`'["sg-0123456789"]'`	The IDs of the security groups used by the primary MWAA environment. Note that the brackets, `[]`, are necessary to denote a list even for a single element list.
`PRIMARY_SUBNET_IDS`	`'["subnet-1234567", "subnet-987654321"]'`	The ID of the VPC subnets where the primary MWAA environment is deployed. Note that the brackets, `[]`, are necessary to denote a list even for a single element list.
`PRIMARY_VPC_ID`	`vpc-012ab34c56d789101`	The ID of the VPC where the primary MWAA environment is deployed.
`SECONDARY_CREATE_SFN_VPCE`	`YES` or `NO`	Flag to denote whether to create a VPC endpoint for Step Functions. The VPCE is particularly important for MWAA running in private mode, where workers may not have internet access to send task token response to the Step Functions orchestrating the restore workflow. If `NO` is chosen, then you will need to manually create the VPC endpoint. Enabling this flag may modify your VPC's security group. See the Automated Update to the VPC Security Group for details.
`SECONDARY_DAGS_BUCKET_NAME`	`mwaa-2-5-1-secondary-bucket`	The name of the S3 DAGs bucket used by the environment in the secondary region.
`SECONDARY_MWAA_ENVIRONMENT_NAME`	`mwaa-2-5-1-secondary`	The name of the MWAA environment in the secondary region.
`SECONDARY_MWAA_ROLE_ARN`	`arn:aws:...:role/service-role/secondary-role`	The ARN of the execution role used by the secondary MWAA environment.
`SECONDARY_REGION`	`us-west-1`, `us-west-2`, ...	The secondary AWS region for diaster recovery.
`SECONDARY_SECURITY_GROUP_IDS`	`'["sg-0123456789"]'`	The IDs of the security groups used by the secondary MWAA environment. Note that the brackets, `[]`, are necessary to denote a list even for a single element list.
`SECONDARY_SUBNET_IDS`	`'["subnet-1234567", "subnet-987654321"]'`	The ID of the VPC subnets in the secondary region where the MWAA environment is deployed. Note that the brackets, `[]`, are necessary to denote a list even for a single element list.
`SECONDARY_VPC_ID`	`vpc-012ab34c56d789101`	The ID of the VPC where the secondary MWAA environment is deployed.
`STACK_NAME_PREFIX`	`mwaa-2-5-1-data-team`	A name prefix for the deployment stacks. This prefix will be used for primary and secondary stacks as well as their resources.

Optional Parameters

Here are the optional parameters that applies to both primary and secondary region stacks:

Variable Name	Default Value	Example Values	Description
`DR_CONNECTION_RESTORE_STRATEGY`	`APPEND`	`DO_NOTHING`, `APPEND`, or `REPLACE`	The strategy to use to restore the connection table during recovery workflow. Review Special Handling of Variable and Connection Tables for details.
`DR_VARIABLE_RESTORE_STRATEGY`	`APPEND`	`DO_NOTHING`, `APPEND`, or `REPLACE`	The strategy to use to restore the variable table during recovery workflow. Review Special Handling of Variable and Connection Tables for details.
`HEALTH_CHECK_ENABLED`	`YES`	`YES` or `NO`	Whether to enable periodic health check of the primary MWAA environment from the secondary region. If set to `NO` the, primary region failure will go undetected and the onus is on admins to manually trigger the recovery workflow.
`HEALTH_CHECK_INTERVAL_MINS`	`5`	time interval in minutes	Health check frequency of the primary mwaa environment in mins.
`HEALTH_CHECK_MAX_RETRY`	`2`	number	The maximum number of retries after the health check of the primary region MWAA fails before moving on to the disaster recovery flow.
`HEALTH_CHECK_RETRY_BACKOFF_RATE`	`2`	number	Health check retry exponential backoff rate (exponential backoff common ratio).
`HEALTH_CHECK_RETRY_INTERVAL_SECS`	`5`	time interval in seconds	Health check retry interval (exponential backoff coefficient) on failure.
`METADATA_CLEANUP_DAG_NAME`	cleanup_metadata	a dag name	Name of the DAG that cleans up metadata store.
`METADATA_EXPORT_DAG_NAME`	backup_metadata	a dag name	Name of the DAG that exports metadata.
`METADATA_IMPORT_DAG_NAME`	restore_metadata	a dag name	Name of the DAG that imports metadata.
`MWAA_BACKUP_FILE_NAME`	`environment.json`	a json file name	Name of the file (json) to used for storing environment details in the backup S3 bucket.
`MWAA_CREATE_ENV_POLLING_INTERVAL_SECS`	`60`	interval in seconds	Wait time before checking status of the MWAA environment in the polling loop during creation.
`MWAA_DAGS_S3_PATH`	`dags`	`path/to/dags`	Path to the folder in the DAGs S3 bucket where DAGs are deployed.
`MWAA_NOTIFICATION_EMAILS`	`[]`	`'["ad@eg.com"]'`, `'["ad@eg.com", "ops@eg.com"]'`	Comma separated list of emails. Note that the brackets, `[]`, are necessary to denote a list even for a single element list.
`MWAA_SIMULATE_DR`	`NO`	`YES` or `NO`	Whether to simulate a DR by artificially forcing health check failure for the MWAA environment in the primary region. Only use for testing.
`PRIMARY_BACKUP_SCHEDULE`	`'0 * * * *'`	`@hourly`, `@daily`, or any cron expressions	Cron schedule for taking backup of the metadata store.
`PRIMARY_REPLICATION_POLLING_INTERVAL_SECS`	`30`	wait time in seconds	The polling internal in secs for checking the status of the one time replication job during primary stack deployment.
`SECONDARY_CLEANUP_COOL_OFF_SECS`	`30`	wait time in seconds	The cool of time in secs between the metadata store cleanup operation and the restore operation in the recovery workflow.
`STATE_MACHINE_TIMEOUT_MINS`	`60`	timeout in minutes	The restore Step Fuctions workflow timeout in minutes.

Automated Updates to the Execution Role

Note that the secondary region stack will add an additional policy statement to the MWAA execution role for the secondary region if the configuration parameter MWAA_UPDATE_EXECUTION_ROLE is set to YES. If you intend to set this parameter to NO, then please add the following policy entry to the secondary MWAA execution role:

{
    "Effect": "Allow",
    "Action": [
        "states:SendTaskFailure",
        "states:SendTaskHeartbeat",
        "states:SendTaskSuccess"
    ],
    "Resource": ["arn:aws:states:*:<account>:stateMachine:*"]
}

Automated Update to the VPC Security Group

Note that if you supplied a VPC security group for your MWAA environment and if the security group does not allow inbound HTTPS traffic (port 443) originating from within the VPC CIDR range, then the stack will add a new rule to the security group to do so. The HTTPS traffic is required for the use of StepFunctions interface endpoint that make the StepFunctions accessible to your private network through AWS PrivateLink.

Step-By-Step Deployment Guide

The project uses Cloud Development Kit (CDK) and is set up like a standard Python project. Assuming that you have AWS credentials for deploying the project setup for your command shell, follow these steps to build and deploy the solution to your AWS account.

Bootstrap Your AWS Account

If you account has not been setup to use CDK yet, you will need to perform a one time cdk bootstrapping for both primary and secondary regions using the following command: cdk bootstrap cdk bootstrap aws://<account>/<primary-region> aws://<account>/<secondary-region>. Here's an example:

cdk bootstrap aws://123456789999/us-east-1 aws://123456789999/us-east-2

Clone the Project

Let's clone the project in your local machine as follows:

git clone https://github.com/aws-samples/mwaa-disaster-recovery.git
cd mwaa-disaster-recovery

This deployment guide walks through first deploying the stack in backup and restore mode followed by warm standby.

Backup and Restore Tutorial

BR-1: Create Necessary AWS Resources

If you already don't have an MWAA environment, use the quickstart guide or follow these steps to create a new MWAA environment:

Create a S3 bucket with versioning enabled on AWS console in your primary region, let's call it mwaa-2-5-1-primary-source (you will probably need to specify a different name as S3 bucket name must be globally unique).
Assuming you will name your primary mwaa environment mwaa-2-5-1-primary, create an IAM role as documented in the AWS Resources pre-requisites section.
Create an MWAA environment on AWS console and using S3 bucket and execution role that you created in steps 1 and 2. Choose default VPC, subnets, and security group.
Similarly create another S3 bucket (with versioning enabled) and IAM role in your secondary region.

BR-2: Setup Local Virtual Environment

You will need a virtualenv created within the project, which is stored under the .venv directory. To create the virtualenv, it assumes that there is a python3 executable in your path with access to the venv package. Create your virtualenv as follows:

python3 -m venv .venv

Next, you will need to activate your virtual environment.

MacOS / Linux:

source .venv/bin/activate

Windows:

.venv\Scripts\activate.bat

Once the virtualenv is activated, you will need to install the required dependencies:

pip install -r requirements.txt
pip install -r requirements-dev.txt

BR-3: Setup Environment Variables

Create a .env file at the root of the project by copying the following contents and making appropriate changes. The configuration parameters are explained in the stack parameters section.

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
DR_TYPE=BACKUP_RESTORE

MWAA_VERSION=2.5.1
MWAA_UPDATE_EXECUTION_ROLE=YES

PRIMARY_REGION=us-east-1
PRIMARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-primary
PRIMARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-primary-role
PRIMARY_DAGS_BUCKET_NAME=mwaa-2-5-1-primary-source
PRIMARY_VPC_ID=vpc-012ab34c56d789101
PRIMARY_SUBNET_IDS='["subnet-1234567", "subnet-987654321"]'
PRIMARY_SECURITY_GROUP_IDS='["sg-0123456789"]'

SECONDARY_REGION=us-east-2
SECONDARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-secondary
SECONDARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-secondary-role
SECONDARY_DAGS_BUCKET_NAME=mwaa-2-5-1-secondary-source
SECONDARY_VPC_ID=vpc-1111222233334444
SECONDARY_SUBNET_IDS='["subnet-2222222", "subnet-3333333"]'
SECONDARY_SECURITY_GROUP_IDS='["sg-111222333444"]'
SECONDARY_CREATE_SFN_VPCE=YES

BR-4: Build the Project

At this point you can now synthesize the CloudFormation template for this code:

cdk synth

You can also see what stacks and resources get created by typing:

cdk diff

BR-5: Deploy the Solution

Now you are ready to deploy the stacks, the following deploys both primary and the secondary region stacks:

cdk deploy --all

BR-6: Explore the Airflow UI

From MWAA console, explore the Airflow UI, it should have the following DAGs available:

backup_metadata - this backs up metadata based on the provided schedule and should be enabled. There is a small chance that the MWAA scheduler had not finished parsing/detecting this DAG before the stack deployment enabled it, in which case, it will appear paused. Ensure that you unpause it in the Airflow UI.
cleanup_metadata - this DAG provided as a part of the mwaa_dr framework will cleanup the metadata store for restore to work without database constraint violations. This is a utility DAG and can used as needed just before the restore process and should be in the disabled state.
restore_metadata - this DAG will restore metadata from S3 backup and should be in the disabled state.

Feel free to upload additional dags and play around to generate some metadata for the backup restore process. Here is a sample dag that you can upload in the dags folder of your DAGs S3 bucket.

BR-7: Manual Backup

While the backups are taken automatically based on the supplied schedule, you can manually trigger the backup_metadata DAG to force generate backup in the corresponding backup (not DAGs) S3 bucket. Explore the data folder in the backup S3 bucket to review the CSV dump generated by the backup DAG. The stack automatically replicates both DAGs and backup S3 buckets to the secondary region.

BR-8: Simulate DR for Testing

You can simulate a DR situation by enabling the MWAA_SIMULATE_DR parameter in your .env file as follows:

MWAA_SIMULATE_DR=YES

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...

Now re-deploy the project:

cdk deploy --all

BR-9: Monitor the DR StepFunctions Workflow

On AWS console, monitor the StepFunction deployed as a part of the secondary region stack, which will orchestrate creating a new environment in the secondary region and eventually restoring the backup data to the newly created environment.

The Airflow UI should show past DAG runs as well as logs, variables, and connections restored from the primary MWAA environment. All the active DAGs in the primary region should also be active in the secondary region.

Warm Standby Tutorial

WS-1: Create Necessary AWS Resources

The Warm Standby approach will need an MWAA environment each in two AWS regions. If you already don't have MWAA environments, use the quickstart guide or follow these steps to create a new MWAA environment each in the two AWS regions:

Create a S3 bucket with versioning enabled on AWS console in your primary region, let's call it mwaa-2-5-1-primary-source (you will probably need to specify a different name as S3 bucket name must be globally unique). Similarly, create another bucket with versioning enabled in the secondary region, let's call it mwaa-2-5-1-secondary-source bucket.
Assuming you will name your primary mwaa environment mwaa-2-5-1-primary, create an IAM role as documented in the AWS Resources pre-requisites section. You will need two roles, one for each MWAA environment in the two regions.
Create an MWAA environment on AWS console and using S3 bucket and execution role that you created in steps 1 and 2. Choose default VPC, subnets, and security group in the primary region. Similarly, you will need to create another environment in the secondary MWAA region.

WS-2: Setup Local Virtual Environment

python3 -m venv .venv

Next, you will need to activate your virtual environment.

MacOS / Linux:

source .venv/bin/activate

Windows:

.venv\Scripts\activate.bat

Once the virtualenv is activated, you can install the required dependencies:

pip install -r requirements.txt
pip install -r requirements-dev.txt

WS-3: Setup Environment Variables

Create a .env file at the root of the project by copying the following contents and making appropriate changes. The configuration parameters are explained in the stack parameters section.

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
DR_TYPE=WARM_STANDBY

MWAA_VERSION=2.5.1
MWAA_UPDATE_EXECUTION_ROLE=YES

PRIMARY_REGION=us-east-1
PRIMARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-primary
PRIMARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-primary-role
PRIMARY_DAGS_BUCKET_NAME=mwaa-2-5-1-primary-source
PRIMARY_VPC_ID=vpc-012ab34c56d789101
PRIMARY_SUBNET_IDS='["subnet-1234567", "subnet-987654321"]'
PRIMARY_SECURITY_GROUP_IDS='["sg-0123456789"]'

SECONDARY_REGION=us-east-2
SECONDARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-secondary
SECONDARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-secondary-role
SECONDARY_DAGS_BUCKET_NAME=mwaa-2-5-1-secondary-source
SECONDARY_VPC_ID=vpc-1111222233334444
SECONDARY_SUBNET_IDS='["subnet-2222222", "subnet-3333333"]'
SECONDARY_SECURITY_GROUP_IDS='["sg-111222333444"]'
SECONDARY_CREATE_SFN_VPCE=YES

WS-4: Build the Project

At this point you can now synthesize the CloudFormation template for this code:

cdk synth

You can also see what stacks and resources get created by typing:

cdk diff

WS-5: Deploy the Solution

Now you are ready to deploy the stacks, the following deploys both primary and the secondary region stacks:

cdk deploy --all

WS-6: Explore the Airflow UI

From MWAA console, explore the Airflow UI, it should have the following DAGs available:

backup_metadata - this backs up metadata based on the provided schedule and should be enabled. There is a small chance that the MWAA scheduler had not finished parsing/detecting this DAG before the stack deployment enabled it, in which case, it will appear paused. Ensure that you unpause it in the Airflow UI.
cleanup_metadata - this DAG provided as a part of the mwaa_dr framework will cleanup the metadata store for restore to work without database constraint violations. This is a utility DAG and can used as needed during DR testing on the existing secondary MWAA environment. The DAG should be in the disabled state.
restore_metadata - this DAG will restore metadata from S3 backup and should be in the disabled state.

Feel free to upload additional dags and play around to generate some metadata for the backup restore process. Here is a sample dag that you can upload in the dags folder of your DAGs S3 bucket.

WS-7: Manual Backup

WS-8: Simulate DR for Testing

You can simulate a DR situation by setting the MWAA_SIMULATE_DR parameter in your .env file as follows:

MWAA_SIMULATE_DR=YES

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...

Now re-deploy the project:

cdk deploy --all

WS-9: Monitor the DR StepFunctions Workflow

On AWS console, monitor the StepFunction deployed as a part of the secondary region stack, which will orchestrate restoring the backup data in the existing secondary MWAA environment.

Clean Up

You can clean up the resources deployed through this solution by simply deleting the stacks as follows:

cdk destroy --all

[!CAUTION] Destroying the stacks will also delete the backup S3 buckets in both primary and secondary regions. DAGs S3 buckets in both region will remain intact and the dags/mwaa_dr folder in both buckets will need to be manually deleted. For the backup restore strategy, environment created as a result of the restore workflow in the secondary region will also need to be manually deleted either on AWS Console or through AWS CLI.

Limitations and Special Cases

The project offers a custom solution to address the disaster recovery needs for Amazon MWAA. Since it is a non-native solution, there are some important limitations to be aware of as follows:

Data Loss Probability

The project only takes metadata backup of the tasks that are not actively running in the primary environment, i.e, it excludes task instances in any of [running, restarting, queued, scheduled, up_for_retry, and up_for_reschedule] states. Hence, the solution cannot restore an actively running DAG in the secondary environment. If the primary environment fails while actively running some DAGs, then those DAGs will restart at the specified next schedules after cut over to the secondary environment. If those DAGs do not have schedules specified, then the admins will need to manually trigger them in the secondary location.

[!CAUTION] As a side-effect of the aforementioned strategy, the metadata of the most recently run backup_metadata DAG will be excluded from the backup, as the DAG will be in active state when its taking backup of the metadata.

[!IMPORTANT] Note that, by default, the solution backs up only variable, connection, slot_pool, log, job, dag_run, trigger, task_instance, task_fail, and xcom tables. Majority of other tables are auto-generated by scheduler or by the webserver and thus, excluded from the list of tables to be backed up. You can add/remove the tables to be backed by simply returning a custom list in the dr_factory.setup_tables() method corresponding to your MWAA version in the codebase. By default, all DR factories are chained by class inheritance with the base class, DRFactory_2_5.

Special Handling of Variable and Connection Tables

The most recent backup of the primary environment will always override the metadata of the secondary environment except for the variable and connection tables. These tables may need to be handled specially and the solution supports three different restore strategies for them as follows:

DO_NOTHING: As the name suggests, this strategy will not restore the variable and connection tables from the backup. This strategy is particularly useful if your MWAA environments have been configured to use AWS Secrets Manager for storing variables and connections, particularly, applicable for the warm standby deployment.
APPEND: In many cases, the secondary Amazon MWAA environment will likely need to interact with different data sources and web services running in the secondary region. Hence, with this strategy, the restore workflow will not overwrite existing entries of the variable and connection tables in the secondary MWAA environment from the backup. This is the default strategy for the warm standby deployment.
REPLACE: This strategy can be used to overwrite existing variable and connections from backup. This is the default strategy for the backup and restore deployment.

The solution automatically reads these configuration from your .env file or environment variables during deployment. To change the default restore behavior for variable and connection tables, you will need to supply an appropriate value for DR_VARIABLE_RESTORE_STRATEGY and DR_CONNECTION_RESTORE_STRATEGY, respectively. Here is an example .env file for a warm standby deployment:

DR_VARIABLE_RESTORE_STRATEGY=DO_NOTHING
DR_CONNECTION_RESTORE_STRATEGY=DO_NOTHING

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...

[!NOTE] Please note that the back restore deployment only supports DO_NOTHING and REPLACE strategies, where as the warm standby deployment supports all three.

[!IMPORTANT] For using the mwaa-dr framework independent to the DR solution, you will need to similarly set DR_VARIABLE_RESTORE_STRATEGY and DR_CONNECTION_RESTORE_STRATEGY Airflow variables. Note that these two Airflow variables are treated specially and are unaffected by the restore process. In their absence, the default value of APPEND is used during the restore workflow in the independent use.

Clean Metadata Tables Required for the Restore Workflow

The solution backs up variable, connection, slot_pool, log, job, dag_run, trigger, task_instance, task_fail, and xcom tables by default during the backup workflow in the primary region. If any of these tables are non-empty during a recovery workflow in the secondary region, then you will encounter database key constraint violations in the metadata store. To avoid this issue, the Warm Standby worfklow automatically cleans up the secondary region MWAA metadata using the clean_metadata DAG during execution.

Manually Triggering the Recovery Workflow

There might be an organizational need to manually trigger the recovery workflow rather than relying on the Amazon EventBridge schedule that runs the health check (and the recovery workflow when the health check fails) every 5 mins by default. To disable this periodic health check and automated recovery flow, set the HEALTH_CHECK_ENABLED environment variable to NO in the .env file locally or in environment variable configurations of your CI/CD pipeline. Here is a sample snippet of the expected .env file:

HEALTH_CHECK_ENABLED=NO

STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
# ... elided for brevity

To manually trigger the recovery workflow, find the Step Functions workflow in the secondary region stack and start the new execution by supplying the following input:

{
  "simulate_dr": "YES"
}

This is also a great way to manually test your disaster recovery setup!

Using the Metadata Backup and Restore DAGs Independently

There might be a need where you only want to perform backup and restore operations without the full DR solution. You can run the backup and restore independently in two modes:

For production use in a public web server mode, we recommend using the published mwaa_dr library to create the necessary DAG for backup and restore in your MWAA environment.

For a private webserver mode, you can copy the assets/dags/mwaa_dr folder to your S3's dags folder. Also, copy the contents of requirements.txt to the MWAA requirements file.

For both modes, please make sure of the following:

Ensure you have an S3 bucket created to store the backup.
Ensure that your MWAA execution role has read and write permissions on the bucket.
Create an Airflow variable with the key named DR_BACKUP_BUCKET and the value containing the name (not ARN) of the S3 bucket.
You are all set to manually trigger the backup and restore DAGs at any point. The metadata backup will be stored in <backup S3 bucket>/<path_prefix>.

For testing the mwaa_dr library itself, you can run aws-mwaa-local-runner container locally by simply copying the assets/dags/mwaa_dr folder into the dags folder of the local runner codebase. Also, copy the contents of requirements.txt to the local runner's requirements file. Finally, export an Airflow variable in the startup_script/startup.sh file of the local runner as follows:

export AIRFLOW_VAR_DR_STORAGE_TYPE=LOCAL_FS

After the setup you are all set to run the backup_metadata and restore_metadata dags. The metadata will be stored and restored from/to the dags/data/ folder of the mwaa-lcoal-runner codebase.

[!IMPORTANT] Note that this is a great way to test support for a new version of MWAA.

May Need to Restart Environment for Plugins to Work

If you have plugins that rely on variables and connections, particularly, for the Backup Restore approach, you may need to manually restart the MWAA environment after the restore is complete for the solution to work. The plugins get loaded in the secondary MWAA environment immediately after it is created before the variables and connections can be restored, thus, breaking your plugins dependencies. Restarting the environment will help mitigate this issue.

Frequently Asked Questions

This section documents some of the frequently asked questions around the solutions:

FAQ-1: Failure to Read Environment Backup

Question: I am trying to test the Backup and Restore DR solution. I have set MWAA_SIMULATE_DR=YES, but I am getting S3.S3Exception with status code 403 - Access Denied in the Read Environment Backup state as follows:

Answer: For the restore workflow to work, you must have one successful run of the workflow that follows the alternative path after the Check Heartbeat state, where it gets the primary environment details (Get Environmnent Details) and stores the configuration in S3 (Store Environment Details). In the absence of this configuration file, the Read Environment Backup state will fail with error.

To resolve this issue, redeploy your stack with MWAA_SIMULATE_DR=NO and wait for the workflow to finish successfully. This run will store the primary environment configuration in the secondary backup S3 bucket. Now redeploy your stack with MWAA_SIMULATE_DR=YES.

FAQ-2: Failure to Create New Environment

Question: I am trying to test the Backup and Restore DR solution. I have set MWAA_SIMULATE_DR=YES, but I am getting the following ValidationException in the Create New Environment state:

An error occurred (ValidationException) when calling the CreateEnvironment operation: Unable to access version <version-string-secondary> of <secondary-region-dags-bucket>/requirements.txt

This issue occurs when the version of requirements.txt file in the secondary region DAGs bucket does not match that of the primary region DAGs bucket.

To resolve this issue, please follow these steps:

Modify the create_replication_job_custom_resource function in mwaa_primary_stack to replace on_create with on_update.
Redeploy your stack with MWAA_SIMULATE_DR=NO and wait for the StepFunctions workflow in the secondary region stack to finish successfully at least once. This will ensure that the latest primary environment configuration is stored in the secondary region backup bucket for future use.
Enable the Event Bridge schedule if it's in disabled state from your AWS console in the secondary region so the restore workflow can start again.
Redeploy your stack with MWAA_SIMULATE_DR=YES, which should now pick up the right version of the requirements file from the secondary DAGs bucket.
Revert the change you made to the mwaa_primary_stack by replacing on_update with on_create in the create_replication_job_custom_resource function.

The stack deployment triggers a StepFunctions workflow that replicates existing object from primary S3 DAGs bucket to the secondary bucket:

ReplicationWorkflow

Development Notes

The contributing guide explains the process of forking the project before creating a pull request. After you have cloned your forked repository locally and made some code changes, please ensure that you have run the following commands supplied in build.sh script as follows:

Lint and unit tests

python3 -m venv venv # Create venv
source ./venv/bin/activate # Activate venv
pip install -r requirements.txt # Install requirements.txt
pip install -r requirements-dev.txt # Install requirements-dev.txt
./build.sh lint # To run linting
./build.sh unit # To run unit tests

Please ensure that the code coverage has not decreased after your changes before creating a pull request.

Integration tests

Please clone the aws-mwaa-local-runner repository with:

git clone https://github.com/aws/aws-mwaa-local-runner.git
./build.sh setup <VERSION>

[!IMPORTANT]

You will need to have Docker running for both unit and integration tests to work.

You may need to enable Windows Subsystem for Linux to run the build scripts in a Windows OS.

Make sure port 8080 is not used by another process.

View the container startup with the docker logs command.

Monitor the containers with docker ps.

After the aws-mwaaa-local-runner container is up and healthy, you can access Airflow by navigating to http://localhost:8080.

Username/password: admin/test

Please also review the Using the Metadata Backup and Restore DAGs Independently section on how to run the backup and restore locally for testing the mwaa_dr framework.

aws-samples / mwaa-disaster-recovery

readme

MWAA Disaster Recovery

Contents

Introduction

Architecture

Backup and Restore

BR Flow 1: Periodic Backup

BR Flow 2: Recovery

Warm Standby

WS Flow 1: Periodic Backup

WS Flow 2: Recovery

Solution

Prerequisites

Software Requirements

AWS Resources Needed Pre-Deployment

Stack Parameters

Required Parameters

Optional Parameters

Automated Updates to the Execution Role

Automated Update to the VPC Security Group

Step-By-Step Deployment Guide

Bootstrap Your AWS Account

Clone the Project

Backup and Restore Tutorial

BR-1: Create Necessary AWS Resources

BR-2: Setup Local Virtual Environment

BR-3: Setup Environment Variables

BR-4: Build the Project

BR-5: Deploy the Solution

BR-6: Explore the Airflow UI

BR-7: Manual Backup

BR-8: Simulate DR for Testing

BR-9: Monitor the DR StepFunctions Workflow

Warm Standby Tutorial

WS-1: Create Necessary AWS Resources

WS-2: Setup Local Virtual Environment

WS-3: Setup Environment Variables

WS-4: Build the Project

WS-5: Deploy the Solution

WS-6: Explore the Airflow UI

WS-7: Manual Backup

WS-8: Simulate DR for Testing

WS-9: Monitor the DR StepFunctions Workflow

Clean Up

Limitations and Special Cases

Data Loss Probability

Special Handling of Variable and Connection Tables

Clean Metadata Tables Required for the Restore Workflow

Manually Triggering the Recovery Workflow

Using the Metadata Backup and Restore DAGs Independently

May Need to Restart Environment for Plugins to Work

Frequently Asked Questions

FAQ-1: Failure to Read Environment Backup

FAQ-2: Failure to Create New Environment

Development Notes

Lint and unit tests

Integration tests