apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.43k stars 2.42k forks source link

[SUPPORT] hudi is incurring emrfs eTag inconsistency issue with s3 and emrfs consistent view #1867

Closed tsolanki95 closed 4 years ago

tsolanki95 commented 4 years ago

Received the following error using the default installation of Hudi in EMR 5.29.0 (Hudi version 5.0.0): RetryInvocationHandler: Exception while invoking ConsistencyCheckerS3FileSystem.open over null. Retrying after sleeping for 35000ms. com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: eTag in metadata for File '<s3 path>/.hoodie_partition_metadata' does not match eTag from S3!

This is typically happening due to eTag verification in emrfs consistent view, which verifies that for a file on s3, we are using the latest version of the file (based on the eTag stored in dynamoDB table). We posed this question on stack overflow and saw someone commented that this happens when you are writing files without using emrfs, but rather with standard AWS-SDK. Is current hudi implementation working on emrfs consistent view (a solution we put in earlier to overcome S3 eventual consistency issues in spark)? If so, do we need to disable fs.s3.consistent.metadata.etag.verification.enabled?

bvaradar commented 4 years ago

@umehrot2 : Can you help answer this question. Thanks. Balaji.V

luffyd commented 4 years ago

@tsolanki95 Does this happen at the time read? In my tests, I noticed etags are not being in sync for .hoodie folder. Also what are your reasons to enable consistent view when using hudi.

tsolanki95 commented 4 years ago

@luffyd We put in consistent view as a solution earlier, based on AWS support, to solve issues with using spark with S3 eventual consistency model causing duplicates in our data. We are now looking towards changing some of our datasets to utilize hudi but our compute resources still utilize EMRFS consistent view. As part of the transition, when some of our datasets utilize hudi and some do not, it would be good to be able to run spark with hudi on EMRFS consistent view.

tsolanki95 commented 4 years ago

This is also a field where data quality, precision, and accuracy are important. EMRFS consistent view helps us keep us having issues with s3 consistency, some of the features that hudi provides with rollback capabilities, and auditing and tracking changes made to our table are incredibly powerful for helping find and isolate data quality errors and rollback and rerun data with fixed input data/code.

umehrot2 commented 4 years ago

@tsolanki95 have you tried using hoodie.consistency.check.enabled which is Hudi's in-built mechanism for avoiding eventual consistency issues instead ?

As for this particular issue with EmrFS consistent view are these temporary errors which resolve on retrying or is it causing the job to fail ? Yes disabling fs.s3.consistent.metadata.etag.verification.enabled could be a way ahead if this is blocking you while EMR team can try investigating this issue.

cc @bschell who actually worked on the etag feature in EmrFS. Do you see any obvious cause for this ? Else, we can possibly have them open a ticket to AWS EMR support and investigate from there.

umehrot2 commented 4 years ago

Also on a side note, we always recommend using latest EMR releases as it has latest fixes and version of applications. So you may want to use emr-5.30.1 instead.

bschell commented 4 years ago

@tsolanki95 As mentioned, it would be good to know the steps that you take to encounter this issue. Is this consistently reproducible? Does it resolve on retry? Otherwise it might be best to open a ticket with AWS EMR Support.

bvaradar commented 4 years ago

@tsolanki95 : This would be best addressed by opening a ticket with EMR support. Closing this ticket. Please reopen if this is specific to hudi.

absognety commented 3 years ago

@tsolanki95 what resolved this issue, I am facing the same issue when reading data written in hudi format from S3