Open danielfordfc opened 1 year ago
cc: @rahil-c
Would this not have been a problem if we'd have configured a lock provider? https://hudi.apache.org/docs/concurrency_control
Regardless, YARN seemingly continuing to run the deltastreamer after the EMR Step has been cancelled is definitely problematic.
Describe the problem you faced
Apache Hudi's deltastreamer utility on EMR, with YARN as the scheduler. I've noticed a rather peculiar behaviour that has started causing sporadic errors in my jobs.
When an EMR job is cancelled either manually by the user or programmatically (in my case, through Airflow), it seems that occasionally YARN doesn't receive the termination message and continues running the job in the background. This behavior is quite misleading, as the EMR Console indicates the job has been stopped, while in reality, it is still running on the cluster.
The issue is further complicated when a cancelled job is rerun. It appears that in these scenarios, two instances of the deltastreamer job run concurrently, without awareness of each other. Considering the nature of deltastreamer, which is designed to run as a single process for a given Hudi table, this parallel execution leads to several inconsistencies and problems.
Symptoms
The manifestation of this issue is quite unpredictable, but some symptoms include:
These symptoms have led to sporadic failures of the jobs and have made our process quite unstable.
For instance, please see a Single deltastreamer ingestion commit timeline, vs., when we have this issue. Jobs spuriously fail out between minutes and hours from starting up, when running the deltastreamer in
--continuous
modeReproduction Steps
Expected behavior
The deltastreamer is somehow aware of the duplicate runs going on? I'm not even 100% certain that that IS whats happening..
Environment Description
emr-6.8.0 S3 Storage Syncing Hive to Glue Catalog through Hudi Deltastreamer.
Additional Context
original slack thread about the issue: https://apache-hudi.slack.com/archives/C4D716NPQ/p1690212730633249
AWS Support ticket response after a couple calls with them and back and forth communications:
Sample Deltastreamer spark-submit command (formatted for readability)
Stacktraces
Some of the stack traces we've seen highly likely due to this issue.
It seems to want to clean up an already deleted file.
Failure to rollback error
Concurrent Glue Sync Error
23/07/07 21:03:23 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:192) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:187) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:555) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:742) Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.hudi.async.HoodieAsyncService.waitForShutdown(HoodieAsyncService.java:103) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$1(HoodieDeltaStreamer.java:190) ... 8 more Caused by: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:715) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:715) at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:634) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:333) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:681) ... 4 more Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing listing_created_v2 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:149) at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:59) ... 8 more Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to get update last commit time synced to Option{val=20230707210236484} at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:300) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:241) at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:158) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:146) ... 9 more Caused by: MetaException(message:Update table failed due to concurrent modifications. (Service: AWSGlue; Status Code: 400; Error Code: ConcurrentModificationException; Request ID: 01e53510-78d3-4d76-b091-eeb39c99625e; Proxy: null)) at com.amazonaws.glue.catalog.converters.BaseCatalogToHiveConverter.getHiveException(BaseCatalogToHiveConverter.java:112) at com.amazonaws.glue.catalog.converters.BaseCatalogToHiveConverter.wrapInHiveException(BaseCatalogToHiveConverter.java:100) at com.amazonaws.glue.catalog.metastore.GlueMetastoreClientDelegate.alterTable(GlueMetastoreClientDelegate.java:569) at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:407) at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.alter_table(AWSCatalogMetastoreClient.java:392) at sun.reflect.GeneratedMethodAccessor312.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2350) at com.sun.proxy.$Proxy83.alter_table(Unknown Source) at org.apache.hudi.hive.HoodieHiveClient.updateLastCommitTimeSynced(HoodieHiveClient.java:298) ... 12 more