[SUPPORT] Disaster Recovery (DR) Setup? Questions.

WTa-hash commented 2 years ago

Are there any tips or support on setting up a Disaster Recovery (DR) environment with Apache Hudi?

We are creating our Datalake, stored on AWS S3, by running a Spark structured streaming application on AWS EMR. The Spark application is processing incoming data from a AWS Kinesis stream and saving them into Hudi tables on S3 and syncing with the AWS Glue catalog. All of this happens in a single AWS region (us-east-1).

In the event where we need to failover to a different region or our main region (us-east-1) goes down, what is the suggested approach to get start up again in another AWS region with our existing Datalake data? We can set up S3 replication to replicate the parquet files (and .hoodie files) to another S3 bucket residing in a different AWS region, but S3 replication happens asynchronously which means files may get replicated out of order and cause issues when querying (due to possible missing files). We will need to look at how to replicate Glue databases/tables from 1 AWS region into another, so that other AWS services and/or query engines can query.

Would love to hear some ideas/thoughts :) Trying to workaround an issue like this: https://www.datacenterdynamics.com/en/news/aws-us-east-1-outage-brings-down-services-around-the-world/

Environment Description

Hudi version : 0.7.0-amzn-1
Spark version : 2.4.7
Hive version : 2.3.7
Hadoop version : 2.10.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no

WTa-hash commented 2 years ago

I ran a test where I deleted some .parquet files from a Hudi table (stored on S3) to simulate S3 replication lagging behind in the copy process and therefore causing the target S3 bucket to have missing data files. Then used AWS Athena to query the Hudi table with missing data files.

Result:

Query ran successfully with missing data files (.parquet) - AWS Athena did not error out.
The returned results were incorrect (count and data returned did not match with original Datalake). This makes sense because .parquet data files are missing.

Next, I wonder what happens if S3 replication is slow and it's not able to replicate the files in .hoodie folder fast enough - what happens when I run SQL queries on a Hudi table with missing .hoodie files?

kazdy commented 2 years ago

There is a preview available for AWS Backup for S3, this might be interesting to use in the future. AWS claims that you can do point in time recovery with it: Continuous backups create point-in-time backups, and allow you to restore S3 resources to any point-in-time within the last 35 days.

WTa-hash commented 2 years ago

Thanks @kazdy for your input. I read the documentation for AWS Backup and S3 support.

Based on the documentation, I am assuming we can do the following (using the us-east-1 issue that occurred on 2021-DEC-7, we want to use latest Hudi tables from us-east-1 in failover region):

FAILOVER SCENARIO:

Create a backup of Hudi Datalake S3 bucket on us-east-1 using AWS Backup and create a copy of that in region X.
When us-east-1 failure/issue occurs, then restore Hudi Datalake S3 bucket backup (created from us-east-1) into a different S3 bucket in region X.

FALLBACK SCENARIO (when us-east-1 is stable again, we want to bring over any Hudi table updates from region X back to primary region us-east-1):

Create a backup of Hudi Datalake S3 bucket on region X using AWS Backup and create a copy of that in us-east-1.
When ready to fallback to us-east-1, then restore Hudi Datalake S3 bucket backup (created from region X) into a S3 bucket in us-east-1.

Sounds like above scenario should work - won't be able to test this until AWS Backup for S3 is generally available (IN PREVIEW is limited to only us-west-2).

We can probably do something similar if we need to failover/fallback between different AWS accounts since AWS Backup supports cross region and cross account.

kazdy commented 2 years ago

@WTa-hash have you tried using savepoint feature? Maybe it'll allow you to deal with these issues when using s3 replication and missing .hoodie files? There will be some missing data, but maybe tables will not be corrupted?