Closed ghost closed 2 years ago
I ran a test where I deleted some .parquet files from a Hudi table (stored on S3) to simulate S3 replication lagging behind in the copy process and therefore causing the target S3 bucket to have missing data files. Then used AWS Athena to query the Hudi table with missing data files.
Result:
Next, I wonder what happens if S3 replication is slow and it's not able to replicate the files in .hoodie folder fast enough - what happens when I run SQL queries on a Hudi table with missing .hoodie files?
There is a preview available for AWS Backup for S3, this might be interesting to use in the future. AWS claims that you can do point in time recovery with it: Continuous backups create point-in-time backups, and allow you to restore S3 resources to any point-in-time within the last 35 days.
Thanks @kazdy for your input. I read the documentation for AWS Backup and S3 support.
Based on the documentation, I am assuming we can do the following (using the us-east-1 issue that occurred on 2021-DEC-7, we want to use latest Hudi tables from us-east-1 in failover region):
FAILOVER SCENARIO:
FALLBACK SCENARIO (when us-east-1 is stable again, we want to bring over any Hudi table updates from region X back to primary region us-east-1):
Sounds like above scenario should work - won't be able to test this until AWS Backup for S3 is generally available (IN PREVIEW is limited to only us-west-2).
We can probably do something similar if we need to failover/fallback between different AWS accounts since AWS Backup supports cross region and cross account.
@WTa-hash have you tried using savepoint feature? Maybe it'll allow you to deal with these issues when using s3 replication and missing .hoodie files? There will be some missing data, but maybe tables will not be corrupted?
@xushiyan @bhasudha @bvaradar @yanghua : Do you folks have any pointes on this regard.
@WTa-hash have you tried using savepoint feature? Maybe it'll allow you to deal with these issues when using s3 replication and missing .hoodie files? There will be some missing data, but maybe tables will not be corrupted?
No, I haven't. Can you link me to documentation about the savepoint feature.
We don't have any documentation as such. You need to directly use writeClient or go via hudi-cli. Hudi-cli is the recommended way.
But here is how you can do savepoint and restore using hudi-cli
connect --path /tmp/hudi_trips_cow
commits show
set --conf SPARK_HOME=[SPARK_HOME_DIR]
savepoint create --commit 20220105222853592 --sparkMaster local[2]
// restore
refresh
savepoint rollback --savepoint 20220106085108487 --sparkMaster local[2]
btw, savepoint and restore for MOR is added just few weeks back and so is available only from 0.11. But should work for COW for older releases too.
I will try to add more documentation around savepoint/restore to our website.
recently we also updated instructions on how to use hudi-cli for S3 dataset. https://hudi.apache.org/docs/next/cli/ just incase you you interested.
Closing this as hudi has savepoint and restore for both table types. Feel free to reopen or create new github issue if you need further assistance. thanks!
For anyone looking at this question now, I see there's documentation available for "current" version (0.11): https://hudi.apache.org/docs/next/disaster_recovery
Are there any tips or support on setting up a Disaster Recovery (DR) environment with Apache Hudi?
We are creating our Datalake, stored on AWS S3, by running a Spark structured streaming application on AWS EMR. The Spark application is processing incoming data from a AWS Kinesis stream and saving them into Hudi tables on S3 and syncing with the AWS Glue catalog. All of this happens in a single AWS region (us-east-1).
In the event where we need to failover to a different region or our main region (us-east-1) goes down, what is the suggested approach to get start up again in another AWS region with our existing Datalake data? We can set up S3 replication to replicate the parquet files (and .hoodie files) to another S3 bucket residing in a different AWS region, but S3 replication happens asynchronously which means files may get replicated out of order and cause issues when querying (due to possible missing files). We will need to look at how to replicate Glue databases/tables from 1 AWS region into another, so that other AWS services and/or query engines can query.
Would love to hear some ideas/thoughts :) Trying to workaround an issue like this: https://www.datacenterdynamics.com/en/news/aws-us-east-1-outage-brings-down-services-around-the-world/
Environment Description
Hudi version : 0.7.0-amzn-1
Spark version : 2.4.7
Hive version : 2.3.7
Hadoop version : 2.10.1
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no