Amazon Managed Workflow for Apache Airflow (MWAA) is a managed orchestration service for Apache Airflow. An MWAA deployment comes with meaningful defaults such as multiple availability zone (AZ) deployment of Airflow schedulers and auto-scaling of Airflow workers across multiple AZs, all of which can help customers minimize the impact of an AZ failure. However, a regional large scale event (LSE) can still adversely affect business continuity of critical workflows running on an MWAA environment. To minimize the impact of LSEs, a multi-region architecture is needed that automatically detects service disruption in the primary region and automates cut-over to the secondary region. This project offers an automated-solution for two key disaster recovery strategies for MWAA: Backup Restore and Warm Standby. Let's review the solution architectures and dive-deep into the two strategies next.
This solution is a part of an AWS blog series on MWAA Disaster Recovery. Please review both Part 1 and Part 2 blog series before diving into the details of the solution.
[!NOTE] The project currently supports the following versions of MWAA:
- 2.8.1
- 2.7.2
- 2.6.3
- 2.5.1
- 2.4.3
In this section, we will discuss two highly resilient, multi-region deployment architectures for MWAA. These architectures can achieve recovery time and recover point objectives of minutes (Warm Standby) to an hour (Backup and Restore) based on volume of historical data to be backed up and restored. Let's discuss the the two strategies in details next.
The general idea behind the backup and restore approach is to have the MWAA environment running in the primary region periodically backup its metadata to an S3 bucket in that region, sync the metadata to the secondary region's S3 bucket, and eventually use the backed up metadata to recreate an identical environment in the secondary region when the primary region fails. This approach can afford an RTO of 30+ minutes depending on the size of metadata to be restored. We assume that you have a running MWAA environment with the associated S3 bucket for hosting DAGs to start with. There are two key workflows to consider in this architecture as shown in the diagram below:
In order to recreate a new environment in secondary region when the primary environment fails, you have to maintain a backup of the primary metadata store. Flow 1 involves an Airflow DAG to take backup of the metadata tables and store them on S3 bucket to restore the MWAA state in secondary region when needed.
[1.a] Assuming you host your DAGs code in a source code repository, your CICD pipeline deploys the code changes to the S3 bucket configured to host DAGs, plugins, and the requirements file.
[1.b] For this architecture, we also assume that you have another S3 bucket deployed to the secondary region to host DAGs, plugins, and the requirements file with bucket versioning enabled. As a part of the CDK deployment of this project, we enable cross-region replication from primary to the secondary region buckets. Additionally, a StepFunctions workflow is triggered during the primary stack deployment to perform a one time replication of existing objects from primary DAGs bucket to the secondary DAGs bucket. Any new changes to the primary DAGs bucket are automatically replicated in the secondary region.
[1.c] The CDK deployment of the primary stack deploys the mwaa_dr framework package to the primary DAGs S3 bucket. This framework includes the (backup_metadata) DAG, which periodically takes backup of the metadata store. The backup interval is configurable and should be based on the recovery point objective (RPO) -- the data loss time during a failure that can be sustained by the business.
[1.d] The CDK deployment of the primary stack also creates a backup S3 bucket to store metadata tables backup. The secondary stack creates another S3 backup bucket in the secondary region. Similar to the DAGs S3 bucket, the backup bucket is also replicated to the corresponding secondary region bucket using S3 cross-region replication ensuring the backup of the Amazon MWAA metadata is available in the secondary region.
The BR Flow 1 helps with backing up the state of the primary MWAA environment. The Flow 2 detects a failure in the primary environment and triggers the recovery of the MWAA environment in the secondary region. The recover involves creating a new MWAA environment from the stored configuration of the primary environment (as a part of Flow 2) and eventually rehydrating the new environment with the metadata backed up from the primary environment (Flow 1).
[2.a] The secondary CDK stack deploys a StepFunctions workflow in the secondary region which is periodically executed based on the supplied EventBridge schedule. The scheduled interval should be based on the RTO that you want, i.e., RTO >= EventBridgeScheduleInterval + EnvironmentCreationTime + MetadataRestoreTime
.
[2.b] The workflow, using an AWS Lambda function, retrieves the SchedulerHeartBeat CloudWatch metrics from the primary MWAA environment.
[2.c.1] When the heartbeat signals from the primary MWAA environment are detected in the metrics, the workflow moves on to storing the environment configuration subflow as follows:
[2.d] The workflow makes GetEnvironment API call through a Lambda function. The API returns the status among other configuration details of the MWAA environment.
[2.e] The workflow stores the environment configuration in the backup bucket to be later used during recovery for creating a new MWAA environment in the secondary region (step 2.h) and ends.
[2.c.2] When the heartbeat signals from the primary MWAA environment are not detected in the metrics, the workflow moves on to the recovery subflow as follows (note that it can take up to 5 minutes to detect an environment failure):
[2.f] As a first step of the recovery subflow, the EventBridge schedule is disabled to prevent subsequent duplicate recovery flow.
[2.g] The environment configuration stored to backup bucket during the previous successful heartbeat (step 2.e) is read to recreate a new MWAA environment.
[2.h] The workflow then uses a Lambda function to create a new MWAA environment using the CreateEnvironment API and waits for the supplied polling interval. The API call uses the environment configuration from the backup bucket with some changes to incorporate the settings of the secondary region such as the different S3 DAGs bucket, VPC, subnets, and security groups that are supplied as secondary stack configuration.
[2.i] Using GetEnvironment API call in an AWS Lambda function, the workflow gets the status of the environment creation.
[2.j.1] Until the status of the newly created MWAA environment becomes available, the workflow waits for a supplied duration and keeps polling the status in a loop.
[2.j.2] When the MWAA environment becomes available the workflow comes out of the polling loop.
[2.k] The workflow then restores the metadata from the S3 backup bucket into the newly created environment by triggering the restore_metadata DAG. It uses the task token integration, in which, it waits for the restore DAG to respond back with success or failure notification before completing the workflow.
[2.l] The DAG for restoring metadata hydrates the newly created MWAA environment with the metadata stored in the backup bucket and finally, returns a success token back to the StepFunctions workflow, which successfully ends the flow.
In the warm standby approach, we start with two identical MWAA environments, one in the primary and the other in the secondary region. The metadata in the primary region is backed up in an S3 bucket with cross-region replication to a secondary region bucket. In case of the primary MWAA environment failure, the backed up metadata is restored in the secondary MWAA to restart the DAG workflows in the secondary region. Since the MWAA environment is already created/warm in the secondary region, this approach can achieve recovery time objective of 5+ minutes depending on the amount of metadata to be restored. There are two key workflows in this architecture as shown in the diagram below:
In order to restore the primary MWAA environment in the secondary region, you have to maintain a backup of the primary metadata store. Flow 1 involves an Airflow DAG to take backup of the metadata tables and store them in an S3 bucket.
[1.a] Assuming you host your DAGs code in a source code repository, your CICD pipeline deploys the code changes to the S3 bucket configured to host DAGs, plugins, and the requirements file.
[1.b] For this architecture, we also assume that you have another S3 bucket deployed to the secondary region to host DAGs, plugins, and the requirements file with bucket versioning enabled. As a part of the CDK deployment of this project, we enable cross-region replication from primary to the secondary region buckets. Additionally, a StepFunctions workflow is triggered during the primary stack deployment to perform a one time replication of existing objects from primary DAGs bucket to the secondary DAGs bucket. Any new changes to the primary DAGs bucket are automatically replicated in the secondary region.
[1.c] The CDK deployment of the primary stack deploys the mwaa_dr framework package to the primary DAGs S3 bucket. This framework includes the (backup_metadata) DAG, which periodically takes backup of the metadata store. The backup interval is configurable and should be based on the recovery point objective (RPO) -- the data loss time during a failure that can be sustained by the business.
[1.d] The CDK deployment of the primary stack also creates a backup S3 bucket to store metadata tables backup. The secondary stack creates another S3 backup bucket in the secondary region. Similar to the DAGs S3 bucket, the backup bucket is also replicated to the corresponding secondary region bucket using S3 cross-region replication ensuring the backup of the Amazon MWAA metadata is available in the secondary region.
As discussed in the previous sections, WS Flow 1 helps backup the metadata of the primary MWAA environment. Flow 2, on the other hand, detects a failure in the primary environment and triggers the recovery of the MWAA environment in the secondary region. The recover involves rehydrating the standby secondary environment with the metadata backed up from the primary environment.
[2.a] The secondary CDK stack deploys a StepFunctions workflow in the secondary region which is periodically executed based on the supplied EventBridge schedule. The scheduled interval should be based on the RTO that you want, i.e., RTO >= EventBridgeScheduleInterval + MetadataRestoreTime
.
[2.b] The workflow, using an AWS Lambda function, retrieves the SchedulerHeartBeat CloudWatch metrics from the primary MWAA environment.
[2.c.1] When the heartbeat signals from the primary MWAA environment are detected in the metrics, the workflow ends as no further actions are needed.
[2.c.2] When the heartbeat signals from the primary MWAA environment are not detected in the metrics, the workflow executes a recovery subflow as follows:
[2.d] As a first step of the recovery subflow, the EventBridge schedule is disabled to prevent subsequent duplicate recovery flow.
[2.e] The workflow then cleans up the metadata database by triggering the cleanup_metadata DAG. It uses the task token integration, in which, it waits for the cleanup DAG to respond back with success or failure notification before completing the workflow. The cleanup before restore is needed to avoid primary key constraint violations in the database.
[2.f] The workflow then restores the metadata from the S3 backup bucket into the standby secondary environment by triggering the restore_metadata DAG. It uses the task token integration, in which, it waits for the restore DAG to respond back with success or failure notification before completing the workflow.
[2.g] The DAG for restoring metadata hydrates the standby MWAA environment with the metadata stored in the backup bucket and finally, returns a success token back to the StepFunctions workflow, which ends the flow successfully.
The lib folder hosts the deployment code for the project. The project performs multi-region deployment of two stacks:
An AWS account with an MWAA environment deployed to the primary region. If you don't have an environment deployed, you can do so using the quickstart guide.
For Warm Standby, another identical MWAA environment deployed to the secondary region. Note that Backup and Restore does not require a running MWAA environment in the secondary region.
DAGs S3 buckets with versioning enabled in both primary and secondary region. Copy the packages in assets/requirements.txt to requirements files in the DAGs S3 buckets if already available or upload the provided requirements file to the buckets and configure the MWAA environments to use the requirements files.
The security groups used by MWAA environment. In the secondary region, this can be the default security group of the VPC if not already defined, i.e., in the case of backup and restore. You can find the VPC, security groups, and subnet information of an existing MWAA environment on your AWS console.
An MWAA execution role each in both regions with the permission you need for your DAGs. At a minimum, please include the following permission policies (replace <region>
, <account>
, <mwaa-env-name>
, and <dags-s3-bucket-name>
with appropriate values):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "airflow:PublishMetrics",
"Resource": "arn:aws:airflow:<region>:<account>:environment/<mwaa-env-name>"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject*",
"s3:GetBucket*",
"s3:List*"
],
"Resource": [
"arn:aws:s3:::<dags-s3-bucket-name>",
"arn:aws:s3:::<dags-s3-bucket-name>/*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:CreateLogGroup",
"logs:PutLogEvents",
"logs:GetLogEvents",
"logs:GetLogRecord",
"logs:GetLogGroupFields",
"logs:GetQueryResults"
],
"Resource": [
"arn:aws:logs:<region>:<account>:log-group:airflow-*"
]
},
{
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups",
"cloudwatch:PutMetricData"
],
"Resource": [
"*"
]
},
{
"Effect": "Allow",
"Action": [
"sqs:ChangeMessageVisibility",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:ReceiveMessage",
"sqs:SendMessage"
],
"Resource": "arn:aws:sqs:<region>:*:airflow-celery-*"
},
{
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:DescribeKey",
"kms:GenerateDataKey*",
"kms:Encrypt"
],
"NotResource": "arn:aws:kms:*:<account>:key/*",
"Condition": {
"StringLike": {
"kms:ViaService": [
"sqs.<region>.amazonaws.com"
]
}
}
}
]
}
Also add the following trust policy to the role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": [
"airflow.amazonaws.com",
"airflow-env.amazonaws.com"
]
},
"Action": "sts:AssumeRole"
}
]
}
The parameters for the solution are externalized as environment variables. You can specify these parameters as environment variables in your CICD pipeline or create a .env
file with appropriate key and values at the root of this project for a deployment from your machine. You can find more details in the implementation sections BR-3: Setup Environment Variables and WS-3: Setup Environment Variables. Let's review the required parameters first followed by the optional ones.
Here are the required parameters that applies to both primary and secondary region stacks:
Variable Name | Example Values | Description |
---|---|---|
AWS_ACCOUNT_ID |
111222333444 |
Your AWS account id. |
DR_TYPE |
BACKUP_RESTORE , WARM_STANDBY |
The disaster recovery strategy to be deployed. |
MWAA_UPDATE_EXECUTION_ROLE |
YES or NO |
Flag to denote whether to update the existing MWAA execution role with new policies for allowing task token return calls from the StepFunctions workflow in the secondary stack. See the Automated Updates to the Execution Role for details. |
MWAA_VERSION |
2.4.3 , 2.5.1 , 2.6.3 , 2.7.2 , 2.8.1 |
The deployed version of MWAA. |
PRIMARY_DAGS_BUCKET_NAME |
mwaa-2-5-1-primary-bucket |
The name of the DAGs S3 bucket used by the environment in the primary region. |
PRIMARY_MWAA_ENVIRONMENT_NAME |
mwaa-2-5-1-primary |
The name of the MWAA environment in the primary region. |
PRIMARY_MWAA_ROLE_ARN |
arn:aws:...:role/service-role/primary-role |
The ARN of the execution role used by the primary MWAA environment. |
PRIMARY_REGION |
us-east-1 , us-east-2 , ... |
The primary AWS region. |
PRIMARY_SECURITY_GROUP_IDS |
'["sg-0123456789"]' |
The IDs of the security groups used by the primary MWAA environment. Note that the brackets, [] , are necessary to denote a list even for a single element list. |
PRIMARY_SUBNET_IDS |
'["subnet-1234567", "subnet-987654321"]' |
The ID of the VPC subnets where the primary MWAA environment is deployed. Note that the brackets, [] , are necessary to denote a list even for a single element list. |
PRIMARY_VPC_ID |
vpc-012ab34c56d789101 |
The ID of the VPC where the primary MWAA environment is deployed. |
SECONDARY_CREATE_SFN_VPCE |
YES or NO |
Flag to denote whether to create a VPC endpoint for Step Functions. The VPCE is particularly important for MWAA running in private mode, where workers may not have internet access to send task token response to the Step Functions orchestrating the restore workflow. If NO is chosen, then you will need to manually create the VPC endpoint. Enabling this flag may modify your VPC's security group. See the Automated Update to the VPC Security Group for details. |
SECONDARY_DAGS_BUCKET_NAME |
mwaa-2-5-1-secondary-bucket |
The name of the S3 DAGs bucket used by the environment in the secondary region. |
SECONDARY_MWAA_ENVIRONMENT_NAME |
mwaa-2-5-1-secondary |
The name of the MWAA environment in the secondary region. |
SECONDARY_MWAA_ROLE_ARN |
arn:aws:...:role/service-role/secondary-role |
The ARN of the execution role used by the secondary MWAA environment. |
SECONDARY_REGION |
us-west-1 , us-west-2 , ... |
The secondary AWS region for diaster recovery. |
SECONDARY_SECURITY_GROUP_IDS |
'["sg-0123456789"]' |
The IDs of the security groups used by the secondary MWAA environment. Note that the brackets, [] , are necessary to denote a list even for a single element list. |
SECONDARY_SUBNET_IDS |
'["subnet-1234567", "subnet-987654321"]' |
The ID of the VPC subnets in the secondary region where the MWAA environment is deployed. Note that the brackets, [] , are necessary to denote a list even for a single element list. |
SECONDARY_VPC_ID |
vpc-012ab34c56d789101 |
The ID of the VPC where the secondary MWAA environment is deployed. |
STACK_NAME_PREFIX |
mwaa-2-5-1-data-team |
A name prefix for the deployment stacks. This prefix will be used for primary and secondary stacks as well as their resources. |
Here are the optional parameters that applies to both primary and secondary region stacks:
Variable Name | Default Value | Example Values | Description |
---|---|---|---|
DR_CONNECTION_RESTORE_STRATEGY |
APPEND |
DO_NOTHING , APPEND , or REPLACE |
The strategy to use to restore the connection table during recovery workflow. Review Special Handling of Variable and Connection Tables for details. |
DR_VARIABLE_RESTORE_STRATEGY |
APPEND |
DO_NOTHING , APPEND , or REPLACE |
The strategy to use to restore the variable table during recovery workflow. Review Special Handling of Variable and Connection Tables for details. |
HEALTH_CHECK_ENABLED |
YES |
YES or NO |
Whether to enable periodic health check of the primary MWAA environment from the secondary region. If set to NO the, primary region failure will go undetected and the onus is on admins to manually trigger the recovery workflow. |
HEALTH_CHECK_INTERVAL_MINS |
5 |
time interval in minutes | Health check frequency of the primary mwaa environment in mins. |
HEALTH_CHECK_MAX_RETRY |
2 |
number | The maximum number of retries after the health check of the primary region MWAA fails before moving on to the disaster recovery flow. |
HEALTH_CHECK_RETRY_BACKOFF_RATE |
2 |
number | Health check retry exponential backoff rate (exponential backoff common ratio). |
HEALTH_CHECK_RETRY_INTERVAL_SECS |
5 |
time interval in seconds | Health check retry interval (exponential backoff coefficient) on failure. |
METADATA_CLEANUP_DAG_NAME |
cleanup_metadata | a dag name | Name of the DAG that cleans up metadata store. |
METADATA_EXPORT_DAG_NAME |
backup_metadata | a dag name | Name of the DAG that exports metadata. |
METADATA_IMPORT_DAG_NAME |
restore_metadata | a dag name | Name of the DAG that imports metadata. |
MWAA_BACKUP_FILE_NAME |
environment.json |
a json file name | Name of the file (json) to used for storing environment details in the backup S3 bucket. |
MWAA_CREATE_ENV_POLLING_INTERVAL_SECS |
60 |
interval in seconds | Wait time before checking status of the MWAA environment in the polling loop during creation. |
MWAA_DAGS_S3_PATH |
dags |
path/to/dags |
Path to the folder in the DAGs S3 bucket where DAGs are deployed. |
MWAA_NOTIFICATION_EMAILS |
[] |
'["ad@eg.com"]' , '["ad@eg.com", "ops@eg.com"]' |
Comma separated list of emails. Note that the brackets, [] , are necessary to denote a list even for a single element list. |
MWAA_SIMULATE_DR |
NO |
YES or NO |
Whether to simulate a DR by artificially forcing health check failure for the MWAA environment in the primary region. Only use for testing. |
PRIMARY_BACKUP_SCHEDULE |
'0 * * * *' |
@hourly , @daily , or any cron expressions |
Cron schedule for taking backup of the metadata store. |
PRIMARY_REPLICATION_POLLING_INTERVAL_SECS |
30 |
wait time in seconds | The polling internal in secs for checking the status of the one time replication job during primary stack deployment. |
SECONDARY_CLEANUP_COOL_OFF_SECS |
30 |
wait time in seconds | The cool of time in secs between the metadata store cleanup operation and the restore operation in the recovery workflow. |
STATE_MACHINE_TIMEOUT_MINS |
60 |
timeout in minutes | The restore Step Fuctions workflow timeout in minutes. |
Note that the secondary region stack will add an additional policy statement to the MWAA execution role for the secondary region if the configuration parameter MWAA_UPDATE_EXECUTION_ROLE
is set to YES
. If you intend to set this parameter to NO
, then please add the following policy entry to the secondary MWAA execution role:
{
"Effect": "Allow",
"Action": [
"states:SendTaskFailure",
"states:SendTaskHeartbeat",
"states:SendTaskSuccess"
],
"Resource": ["arn:aws:states:*:<account>:stateMachine:*"]
}
Note that if you supplied a VPC security group for your MWAA environment and if the security group does not allow inbound HTTPS traffic (port 443) originating from within the VPC CIDR range, then the stack will add a new rule to the security group to do so. The HTTPS traffic is required for the use of StepFunctions interface endpoint that make the StepFunctions accessible to your private network through AWS PrivateLink.
The project uses Cloud Development Kit (CDK) and is set up like a standard Python project. Assuming that you have AWS credentials for deploying the project setup for your command shell, follow these steps to build and deploy the solution to your AWS account.
If you account has not been setup to use CDK yet, you will need to perform a one time cdk bootstrapping for both primary and secondary regions using the following command: cdk bootstrap cdk bootstrap aws://<account>/<primary-region> aws://<account>/<secondary-region>
. Here's an example:
cdk bootstrap aws://123456789999/us-east-1 aws://123456789999/us-east-2
Let's clone the project in your local machine as follows:
git clone https://github.com/aws-samples/mwaa-disaster-recovery.git
cd mwaa-disaster-recovery
This deployment guide walks through first deploying the stack in backup and restore mode followed by warm standby.
If you already don't have an MWAA environment, use the quickstart guide or follow these steps to create a new MWAA environment:
mwaa-2-5-1-primary-source
(you will probably need to specify a different name as S3 bucket name must be globally unique).mwaa-2-5-1-primary
, create an IAM role as documented in the AWS Resources pre-requisites section.You will need a virtualenv created within the project, which is stored under the .venv
directory. To create the virtualenv, it assumes that there is a python3
executable in your path with access to the venv
package. Create your virtualenv as follows:
python3 -m venv .venv
Next, you will need to activate your virtual environment.
MacOS / Linux:
source .venv/bin/activate
Windows:
.venv\Scripts\activate.bat
Once the virtualenv is activated, you will need to install the required dependencies:
pip install -r requirements.txt
pip install -r requirements-dev.txt
Create a .env
file at the root of the project by copying the following contents and making appropriate changes. The configuration parameters are explained in the stack parameters section.
STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
DR_TYPE=BACKUP_RESTORE
MWAA_VERSION=2.5.1
MWAA_UPDATE_EXECUTION_ROLE=YES
PRIMARY_REGION=us-east-1
PRIMARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-primary
PRIMARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-primary-role
PRIMARY_DAGS_BUCKET_NAME=mwaa-2-5-1-primary-source
PRIMARY_VPC_ID=vpc-012ab34c56d789101
PRIMARY_SUBNET_IDS='["subnet-1234567", "subnet-987654321"]'
PRIMARY_SECURITY_GROUP_IDS='["sg-0123456789"]'
SECONDARY_REGION=us-east-2
SECONDARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-secondary
SECONDARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-secondary-role
SECONDARY_DAGS_BUCKET_NAME=mwaa-2-5-1-secondary-source
SECONDARY_VPC_ID=vpc-1111222233334444
SECONDARY_SUBNET_IDS='["subnet-2222222", "subnet-3333333"]'
SECONDARY_SECURITY_GROUP_IDS='["sg-111222333444"]'
SECONDARY_CREATE_SFN_VPCE=YES
At this point you can now synthesize the CloudFormation template for this code:
cdk synth
You can also see what stacks and resources get created by typing:
cdk diff
Now you are ready to deploy the stacks, the following deploys both primary and the secondary region stacks:
cdk deploy --all
From MWAA console, explore the Airflow UI, it should have the following DAGs available:
Feel free to upload additional dags and play around to generate some metadata for the backup restore process. Here is a sample dag that you can upload in the dags
folder of your DAGs S3 bucket.
While the backups are taken automatically based on the supplied schedule, you can manually trigger the backup_metadata
DAG to force generate backup in the corresponding backup (not DAGs) S3 bucket. Explore the data
folder in the backup S3 bucket to review the CSV dump generated by the backup DAG. The stack automatically replicates both DAGs and backup S3 buckets to the secondary region.
You can simulate a DR situation by enabling the MWAA_SIMULATE_DR
parameter in your .env
file as follows:
MWAA_SIMULATE_DR=YES
STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...
Now re-deploy the project:
cdk deploy --all
On AWS console, monitor the StepFunction deployed as a part of the secondary region stack, which will orchestrate creating a new environment in the secondary region and eventually restoring the backup data to the newly created environment.
The Airflow UI should show past DAG runs as well as logs, variables, and connections restored from the primary MWAA environment. All the active DAGs in the primary region should also be active in the secondary region.
The Warm Standby approach will need an MWAA environment each in two AWS regions. If you already don't have MWAA environments, use the quickstart guide or follow these steps to create a new MWAA environment each in the two AWS regions:
mwaa-2-5-1-primary-source
(you will probably need to specify a different name as S3 bucket name must be globally unique). Similarly, create another bucket with versioning enabled in the secondary region, let's call it mwaa-2-5-1-secondary-source
bucket.mwaa-2-5-1-primary
, create an IAM role as documented in the AWS Resources pre-requisites section. You will need two roles, one for each MWAA environment in the two regions.You will need a virtualenv created within the project, which is stored under the .venv
directory. To create the virtualenv, it assumes that there is a python3
executable in your path with access to the venv
package. Create your virtualenv as follows:
python3 -m venv .venv
Next, you will need to activate your virtual environment.
MacOS / Linux:
source .venv/bin/activate
Windows:
.venv\Scripts\activate.bat
Once the virtualenv is activated, you can install the required dependencies:
pip install -r requirements.txt
pip install -r requirements-dev.txt
Create a .env
file at the root of the project by copying the following contents and making appropriate changes. The configuration parameters are explained in the stack parameters section.
STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
DR_TYPE=WARM_STANDBY
MWAA_VERSION=2.5.1
MWAA_UPDATE_EXECUTION_ROLE=YES
PRIMARY_REGION=us-east-1
PRIMARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-primary
PRIMARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-primary-role
PRIMARY_DAGS_BUCKET_NAME=mwaa-2-5-1-primary-source
PRIMARY_VPC_ID=vpc-012ab34c56d789101
PRIMARY_SUBNET_IDS='["subnet-1234567", "subnet-987654321"]'
PRIMARY_SECURITY_GROUP_IDS='["sg-0123456789"]'
SECONDARY_REGION=us-east-2
SECONDARY_MWAA_ENVIRONMENT_NAME=mwaa-2-5-1-secondary
SECONDARY_MWAA_ROLE_ARN=arn:aws:iam::123456789101:role/service-role/mwaa-2-5-1-secondary-role
SECONDARY_DAGS_BUCKET_NAME=mwaa-2-5-1-secondary-source
SECONDARY_VPC_ID=vpc-1111222233334444
SECONDARY_SUBNET_IDS='["subnet-2222222", "subnet-3333333"]'
SECONDARY_SECURITY_GROUP_IDS='["sg-111222333444"]'
SECONDARY_CREATE_SFN_VPCE=YES
At this point you can now synthesize the CloudFormation template for this code:
cdk synth
You can also see what stacks and resources get created by typing:
cdk diff
Now you are ready to deploy the stacks, the following deploys both primary and the secondary region stacks:
cdk deploy --all
From MWAA console, explore the Airflow UI, it should have the following DAGs available:
Feel free to upload additional dags and play around to generate some metadata for the backup restore process. Here is a sample dag that you can upload in the dags
folder of your DAGs S3 bucket.
While the backups are taken automatically based on the supplied schedule, you can manually trigger the backup_metadata
DAG to force generate backup in the corresponding backup (not DAGs) S3 bucket. Explore the data
folder in the backup S3 bucket to review the CSV dump generated by the backup DAG. The stack automatically replicates both DAGs and backup S3 buckets to the secondary region.
You can simulate a DR situation by setting the MWAA_SIMULATE_DR
parameter in your .env
file as follows:
MWAA_SIMULATE_DR=YES
STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...
Now re-deploy the project:
cdk deploy --all
On AWS console, monitor the StepFunction deployed as a part of the secondary region stack, which will orchestrate restoring the backup data in the existing secondary MWAA environment.
The Airflow UI should show past DAG runs as well as logs, variables, and connections restored from the primary MWAA environment. All the active DAGs in the primary region should also be active in the secondary region.
You can clean up the resources deployed through this solution by simply deleting the stacks as follows:
cdk destroy --all
[!CAUTION] Destroying the stacks will also delete the backup S3 buckets in both primary and secondary regions. DAGs S3 buckets in both region will remain intact and the dags/mwaa_dr folder in both buckets will need to be manually deleted. For the backup restore strategy, environment created as a result of the restore workflow in the secondary region will also need to be manually deleted either on AWS Console or through AWS CLI.
The project offers a custom solution to address the disaster recovery needs for Amazon MWAA. Since it is a non-native solution, there are some important limitations to be aware of as follows:
The project only takes metadata backup of the tasks that are not actively running in the primary environment, i.e, it excludes task instances in any of [running
, restarting
, queued
, scheduled
, up_for_retry
, and up_for_reschedule
] states. Hence, the solution cannot restore an actively running DAG in the secondary environment. If the primary environment fails while actively running some DAGs, then those DAGs will restart at the specified next schedules after cut over to the secondary environment. If those DAGs do not have schedules specified, then the admins will need to manually trigger them in the secondary location.
[!CAUTION] As a side-effect of the aforementioned strategy, the metadata of the most recently run backup_metadata DAG will be excluded from the backup, as the DAG will be in active state when its taking backup of the metadata.
[!IMPORTANT] Note that, by default, the solution backs up only
variable
,connection
,slot_pool
,log
,job
,dag_run
,trigger
,task_instance
,task_fail
, andxcom
tables. Majority of other tables are auto-generated by scheduler or by the webserver and thus, excluded from the list of tables to be backed up. You can add/remove the tables to be backed by simply returning a custom list in thedr_factory.setup_tables()
method corresponding to your MWAA version in the codebase. By default, all DR factories are chained by class inheritance with the base class, DRFactory_2_5.
The most recent backup of the primary environment will always override the metadata of the secondary environment except for the variable
and connection
tables. These tables may need to be handled specially and the solution supports three different restore strategies for them as follows:
DO_NOTHING: As the name suggests, this strategy will not restore the variable and connection tables from the backup. This strategy is particularly useful if your MWAA environments have been configured to use AWS Secrets Manager for storing variables and connections, particularly, applicable for the warm standby deployment.
APPEND: In many cases, the secondary Amazon MWAA environment will likely need to interact with different data sources and web services running in the secondary region. Hence, with this strategy, the restore workflow will not overwrite existing entries of the variable and connection tables in the secondary MWAA environment from the backup. This is the default strategy for the warm standby deployment.
REPLACE: This strategy can be used to overwrite existing variable and connections from backup. This is the default strategy for the backup and restore deployment.
The solution automatically reads these configuration from your .env
file or environment variables during deployment. To change the default restore behavior for variable
and connection
tables, you will need to supply an appropriate value for DR_VARIABLE_RESTORE_STRATEGY
and DR_CONNECTION_RESTORE_STRATEGY
, respectively. Here is an example .env
file for a warm standby deployment:
DR_VARIABLE_RESTORE_STRATEGY=DO_NOTHING
DR_CONNECTION_RESTORE_STRATEGY=DO_NOTHING
STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
...
[!NOTE] Please note that the back restore deployment only supports
DO_NOTHING
andREPLACE
strategies, where as the warm standby deployment supports all three.[!IMPORTANT] For using the mwaa-dr framework independent to the DR solution, you will need to similarly set
DR_VARIABLE_RESTORE_STRATEGY
andDR_CONNECTION_RESTORE_STRATEGY
Airflow variables. Note that these two Airflow variables are treated specially and are unaffected by the restore process. In their absence, the default value ofAPPEND
is used during the restore workflow in the independent use.
The solution backs up variable
, connection
, slot_pool
, log
, job
, dag_run
, trigger
, task_instance
, task_fail
, and xcom
tables by default during the backup workflow in the primary region. If any of these tables are non-empty during a recovery workflow in the secondary region, then you will encounter database key constraint violations in the metadata store. To avoid this issue, the Warm Standby worfklow automatically cleans up the secondary region MWAA metadata using the clean_metadata DAG during execution.
There might be an organizational need to manually trigger the recovery workflow rather than relying on the Amazon EventBridge schedule that runs the health check (and the recovery workflow when the health check fails) every 5 mins by default. To disable this periodic health check and automated recovery flow, set the HEALTH_CHECK_ENABLED
environment variable to NO
in the .env
file locally or in environment variable configurations of your CI/CD pipeline. Here is a sample snippet of the expected .env
file:
HEALTH_CHECK_ENABLED=NO
STACK_NAME_PREFIX=mwaa-2-5-1
AWS_ACCOUNT_ID=123456789101
# ... elided for brevity
To manually trigger the recovery workflow, find the Step Functions workflow in the secondary region stack and start the new execution by supplying the following input:
{
"simulate_dr": "YES"
}
This is also a great way to manually test your disaster recovery setup!
There might be a need where you only want to perform backup and restore operations without the full DR solution. You can run the backup and restore independently in two modes:
For production use in a public web server mode, we recommend using the published mwaa_dr library to create the necessary DAG for backup and restore in your MWAA environment.
For a private webserver mode, you can copy the assets/dags/mwaa_dr folder to your S3's dags
folder. Also, copy the contents of requirements.txt to the MWAA requirements file.
For both modes, please make sure of the following:
DR_BACKUP_BUCKET
and the value containing the name (not ARN) of the S3 bucket.<backup S3 bucket>/<path_prefix>
.For testing the mwaa_dr
library itself, you can run aws-mwaa-local-runner container locally by simply copying the assets/dags/mwaa_dr folder into the dags
folder of the local runner codebase. Also, copy the contents of requirements.txt to the local runner's requirements file. Finally, export an Airflow variable in the startup_script/startup.sh
file of the local runner as follows:
export AIRFLOW_VAR_DR_STORAGE_TYPE=LOCAL_FS
After the setup you are all set to run the backup_metadata and restore_metadata dags. The metadata will be stored and restored from/to the dags/data/
folder of the mwaa-lcoal-runner
codebase.
[!IMPORTANT] Note that this is a great way to test support for a new version of MWAA.
If you have plugins that rely on variables and connections, particularly, for the Backup Restore approach, you may need to manually restart the MWAA environment after the restore is complete for the solution to work. The plugins get loaded in the secondary MWAA environment immediately after it is created before the variables and connections can be restored, thus, breaking your plugins dependencies. Restarting the environment will help mitigate this issue.
This section documents some of the frequently asked questions around the solutions:
Question:
I am trying to test the Backup and Restore DR solution. I have set MWAA_SIMULATE_DR=YES
, but I am getting S3.S3Exception
with status code 403 - Access Denied
in the Read Environment Backup
state as follows:
Answer:
For the restore workflow to work, you must have one successful run of the workflow that follows the alternative path after the Check Heartbeat
state, where it gets the primary environment details (Get Environmnent Details
) and stores the configuration in S3 (Store Environment Details
). In the absence of this configuration file, the Read Environment Backup
state will fail with error.
To resolve this issue, redeploy your stack with MWAA_SIMULATE_DR=NO
and wait for the workflow to finish successfully. This run will store the primary environment configuration in the secondary backup S3 bucket. Now redeploy your stack with MWAA_SIMULATE_DR=YES
.
Question:
I am trying to test the Backup and Restore DR solution. I have set MWAA_SIMULATE_DR=YES
, but I am getting the following ValidationException
in the Create New Environment
state:
An error occurred (ValidationException) when calling the CreateEnvironment operation: Unable to access version <version-string-secondary> of <secondary-region-dags-bucket>/requirements.txt
This issue occurs when the version of requirements.txt
file in the secondary region DAGs bucket does not match that of the primary region DAGs bucket.
To resolve this issue, please follow these steps:
create_replication_job_custom_resource
function in mwaa_primary_stack to replace on_create
with on_update
.MWAA_SIMULATE_DR=NO
and wait for the StepFunctions workflow in the secondary region stack to finish successfully at least once. This will ensure that the latest primary environment configuration is stored in the secondary region backup bucket for future use.MWAA_SIMULATE_DR=YES
, which should now pick up the right version of the requirements file from the secondary DAGs bucket.on_update
with on_create
in the create_replication_job_custom_resource
function.The stack deployment triggers a StepFunctions workflow that replicates existing object from primary S3 DAGs bucket to the secondary bucket:
The contributing guide explains the process of forking the project before creating a pull request. After you have cloned your forked repository locally and made some code changes, please ensure that you have run the following commands supplied in build.sh script as follows:
python3 -m venv venv # Create venv
source ./venv/bin/activate # Activate venv
pip install -r requirements.txt # Install requirements.txt
pip install -r requirements-dev.txt # Install requirements-dev.txt
./build.sh lint # To run linting
./build.sh unit # To run unit tests
Please ensure that the code coverage has not decreased after your changes before creating a pull request.
Please clone the aws-mwaa-local-runner
repository with:
git clone https://github.com/aws/aws-mwaa-local-runner.git
./build.sh setup <VERSION>
[!IMPORTANT]
- You will need to have Docker running for both unit and integration tests to work.
- You may need to enable Windows Subsystem for Linux to run the build scripts in a Windows OS.
- Make sure port 8080 is not used by another process.
- View the container startup with the docker logs command.
- Monitor the containers with docker ps.
- After the aws-mwaaa-local-runner container is up and healthy, you can access Airflow by navigating to http://localhost:8080.
- Username/password: admin/test
Please also review the Using the Metadata Backup and Restore DAGs Independently section on how to run the backup and restore locally for testing the mwaa_dr
framework.