apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.47k stars 2.43k forks source link

[SUPPORT] Disaster Recovery (DR) Setup? Questions. #4241

Closed ghost closed 2 years ago

ghost commented 2 years ago

Are there any tips or support on setting up a Disaster Recovery (DR) environment with Apache Hudi?

We are creating our Datalake, stored on AWS S3, by running a Spark structured streaming application on AWS EMR. The Spark application is processing incoming data from a AWS Kinesis stream and saving them into Hudi tables on S3 and syncing with the AWS Glue catalog. All of this happens in a single AWS region (us-east-1).

In the event where we need to failover to a different region or our main region (us-east-1) goes down, what is the suggested approach to get start up again in another AWS region with our existing Datalake data? We can set up S3 replication to replicate the parquet files (and .hoodie files) to another S3 bucket residing in a different AWS region, but S3 replication happens asynchronously which means files may get replicated out of order and cause issues when querying (due to possible missing files). We will need to look at how to replicate Glue databases/tables from 1 AWS region into another, so that other AWS services and/or query engines can query.

Would love to hear some ideas/thoughts :) Trying to workaround an issue like this: https://www.datacenterdynamics.com/en/news/aws-us-east-1-outage-brings-down-services-around-the-world/

Environment Description

ghost commented 2 years ago

I ran a test where I deleted some .parquet files from a Hudi table (stored on S3) to simulate S3 replication lagging behind in the copy process and therefore causing the target S3 bucket to have missing data files. Then used AWS Athena to query the Hudi table with missing data files.

Result:

  1. Query ran successfully with missing data files (.parquet) - AWS Athena did not error out.
  2. The returned results were incorrect (count and data returned did not match with original Datalake). This makes sense because .parquet data files are missing.

Next, I wonder what happens if S3 replication is slow and it's not able to replicate the files in .hoodie folder fast enough - what happens when I run SQL queries on a Hudi table with missing .hoodie files?

kazdy commented 2 years ago

There is a preview available for AWS Backup for S3, this might be interesting to use in the future. AWS claims that you can do point in time recovery with it: Continuous backups create point-in-time backups, and allow you to restore S3 resources to any point-in-time within the last 35 days.

ghost commented 2 years ago

Thanks @kazdy for your input. I read the documentation for AWS Backup and S3 support.

Based on the documentation, I am assuming we can do the following (using the us-east-1 issue that occurred on 2021-DEC-7, we want to use latest Hudi tables from us-east-1 in failover region):

FAILOVER SCENARIO:

  1. Create a backup of Hudi Datalake S3 bucket on us-east-1 using AWS Backup and create a copy of that in region X.
  2. When us-east-1 failure/issue occurs, then restore Hudi Datalake S3 bucket backup (created from us-east-1) into a different S3 bucket in region X.

FALLBACK SCENARIO (when us-east-1 is stable again, we want to bring over any Hudi table updates from region X back to primary region us-east-1):

  1. Create a backup of Hudi Datalake S3 bucket on region X using AWS Backup and create a copy of that in us-east-1.
  2. When ready to fallback to us-east-1, then restore Hudi Datalake S3 bucket backup (created from region X) into a S3 bucket in us-east-1.

Sounds like above scenario should work - won't be able to test this until AWS Backup for S3 is generally available (IN PREVIEW is limited to only us-west-2).

We can probably do something similar if we need to failover/fallback between different AWS accounts since AWS Backup supports cross region and cross account.

kazdy commented 2 years ago

@WTa-hash have you tried using savepoint feature? Maybe it'll allow you to deal with these issues when using s3 replication and missing .hoodie files? There will be some missing data, but maybe tables will not be corrupted?

nsivabalan commented 2 years ago

@xushiyan @bhasudha @bvaradar @yanghua : Do you folks have any pointes on this regard.

ghost commented 2 years ago

@WTa-hash have you tried using savepoint feature? Maybe it'll allow you to deal with these issues when using s3 replication and missing .hoodie files? There will be some missing data, but maybe tables will not be corrupted?

No, I haven't. Can you link me to documentation about the savepoint feature.

nsivabalan commented 2 years ago

We don't have any documentation as such. You need to directly use writeClient or go via hudi-cli. Hudi-cli is the recommended way.

But here is how you can do savepoint and restore using hudi-cli

connect --path /tmp/hudi_trips_cow
commits show
set --conf SPARK_HOME=[SPARK_HOME_DIR]
savepoint create --commit 20220105222853592 --sparkMaster local[2]

// restore

refresh
savepoint rollback --savepoint 20220106085108487 --sparkMaster local[2]
nsivabalan commented 2 years ago

btw, savepoint and restore for MOR is added just few weeks back and so is available only from 0.11. But should work for COW for older releases too.

nsivabalan commented 2 years ago

I will try to add more documentation around savepoint/restore to our website.

nsivabalan commented 2 years ago

recently we also updated instructions on how to use hudi-cli for S3 dataset. https://hudi.apache.org/docs/next/cli/ just incase you you interested.

nsivabalan commented 2 years ago

Closing this as hudi has savepoint and restore for both table types. Feel free to reopen or create new github issue if you need further assistance. thanks!

kazdy commented 2 years ago

For anyone looking at this question now, I see there's documentation available for "current" version (0.11): https://hudi.apache.org/docs/next/disaster_recovery