Open Sheth-G opened 9 months ago
torvalds.dev is analyzing the ticket
It seems like you're facing an issue with Apache Hudi where concurrent writes to the same partition are not being handled as expected, even after adjusting the lock retry/wait-time parameters. You're using DynamoDB as the lock provider and running these loads on AWS EMR. You're seeing ConcurrentModificationException
errors after about 0.5 hours for 40% of the jobs.
Here's a plan to help you debug and potentially resolve this issue:
Check the DynamoDB Locking Mechanism: Ensure that the DynamoDB locking mechanism is working as expected. You can do this by checking the DynamoDB table DatastoreWriteLockTable
to see if locks are being acquired and released as expected.
Check the Retry Parameters: Verify that the retry parameters are being set and used correctly. You can do this by adding logging statements in your code to print out the values of these parameters at runtime.
Check the Conflict Resolution Strategy: The stack trace indicates that the conflict resolution strategy is failing to resolve conflicts for overlapping writes. You can debug this by adding logging statements in the SimpleConcurrentFileWritesConflictResolutionStrategy
class to understand why it's failing.
Check the Order of Operations: From the code snippet you provided, it seems like you're creating commits and then checking for conflicts. Make sure that the order of operations is correct and that you're not checking for conflicts before all the necessary commits have been created.
Check the Apache Hudi Version: You're using Apache Hudi version 0.13.0. Check if there are any known issues related to concurrency control in this version. If there are, consider upgrading to a newer version.
Reach Out to the Apache Hudi Community: If you're still facing issues, consider reaching out to the Apache Hudi community. You can do this by creating an issue on the Apache Hudi GitHub page or by posting on their mailing list. Provide them with all the details of your issue, including the stack trace, the retry parameters you're using, and the version of Apache Hudi you're using.
Remember, the goal is to identify why the ConcurrentModificationException
is being thrown and how to prevent it. This might involve adjusting the retry parameters, changing the conflict resolution strategy, or even modifying the way you're using Apache Hudi.
Describe the problem you faced We have a usecase where might have a lot of concurrent writes to the same partition under special scenarios. We are testing if Hudi supports this natively by changing some of the lock retry/wait-time parameters. We are trying allow all these writers to go through with optimistic retries eventually by using really high num_retries and wait_time_ms parameters. To Reproduce We are using DynamoDB as the lock provider, running these loads on AWS EMR We have the following using options related to concurrency control: hoodie.write.concurrency.mode -> optimistic_concurrency_control, hoodie.write.lock.client.wait_time_ms_between_retry -> 60000, hoodie.write.lock.max_wait_time_ms_between_retry -> 600000, hoodie.write.lock.num_retries -> 60, hoodie.write.lock.wait_time_ms -> 360000, hoodie.write.lock.wait_time_ms_between_retry -> 360000, hoodie.cleaner.policy.failed.writes -> LAZY, hoodie.write.lock.dynamodb.endpoint_url -> dynamodb.us-east-1.amazonaws.com, hoodie.write.lock.dynamodb.partition_key -> PrepareMergeJob-capitalcasetable-NA, hoodie.write.lock.dynamodb.table -> DatastoreWriteLockTable, hoodie.write.lock.provider -> org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider, hoodie.write.lock.dynamodb.region -> us-east-1, We spinned up 7 jobs that writes to the same table. Each job should take around ~20 minutes to finish on its own. Expected behavior These 7 jobs will have conflicting writes and will retry and will succeed eventually. Base on the retry parameters I have set, I'd expect it to run for at least 4 hours. Environment Description Hudi version : 0.13.0 Spark version : 3.3 Hive version : 3.1.3 Hadoop version : 3.3.3 Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : no Additional context Running these workloads on EMR. This is a follow up to this issue: https://github.com/apache/hudi/issues/9512 Stacktrace Seeing these errors after 0.5 hours for 40% of the jobs: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:108) at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:85) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)