Closed rishabhreply closed 2 weeks ago
@rishabhreply Sorry, but I am a bit confused. Do you really want to use insert_overwrite in this case? If you just submit two parallel jobs with insert_overwrite, one is going to overwrite the others data in any case. Even if you sequentially then also you will miss the data ingested by first one. So you can only use insert_overwrite if you want to process all 10 files in one batch.
Let me know in case I am not thinking in right direction
Hi @ad1happy2go , thank you for the response. You understand it correctly. Now my question is which writing parameter I should use instead of _insertoverwrite for such cases?
@rishabhreply Two options -
Run one job sequentially first with operation type "insert_overwrite" and then run parallel jobs with "insert". But the problem with this is as all are writing to same partition you can't use OCC also.
Simply run with all files together. Glue job will anyway parallelise these files and process parallely.
@ad1happy2go thank you for the response. My aim is the second option you mentioned, but my concern is how will HUDI behave if multiple glue jobs get triggered in parallel due to the number of files (10). Will Hudi/Glue be able to write all the data from all the 10 files or will there be any discrepancies in the written data?
@rishabhreply It will handle and process all the 10 files. It is simple spark/distributed computing concept to process files in parallel. Let me know in case I am missing anything.
@ad1happy2go Okay, so if I ingest 10 files altogether and my step function triggers multiple glue job instances to process them then there will be no data discrepancy in the data written by the jobs. Thank you for the effort!
Here is the confusion. There should be 1 glub job instance triggered to load all 10 files. Glue job will run in parallel and can process all files in parallel.
On Thu, Feb 1, 2024 at 1:12 PM Rishabh Sahrawat @.***> wrote:
@ad1happy2go https://github.com/ad1happy2go Okay, so if I ingest 10 files altogether and my step function triggers multiple glue job instances to process them then there will be no data discrepancy in the data written by the jobs. Thank you for the effort!
— Reply to this email directly, view it on GitHub https://github.com/apache/hudi/issues/10559#issuecomment-1920687767, or unsubscribe https://github.com/notifications/unsubscribe-auth/APD55YX4PHXSF5WVODUCSSLYRNBNXAVCNFSM6AAAAABCIN2TROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRQGY4DONZWG4 . You are receiving this because you were mentioned.Message ID: @.***>
Sorry about that. Let me try to rephrase it. In S3 I have 10 files, I have a state machine consisting one glue job with Hudi parameters set and in particular, the partition_key will be same for all the files. The state machine has batch value set to 2 and max concurrency set to 5. In case, you are not aware of this, it means state machine will create 5 batches of 2 size and distribute it to 5 glue job instances. Now, 5 instances of the glue job is reading and will write to the same destination under the same partition. My question is will there be a discrepancy in the data written to the target bc of this parallelization in this setting?
@rishabhreply This won't work. 5 separate jobs running concurrently with insert_overwrite not making any sense as I can commented earlier.
But why can't you have concurrency set to 1 only. and batch size as 10. Let glue process all files/data parallely.
@ad1happy2go I see, but in your suggested way the one glue run will process all 10 files meaning it will take more time as one job has 10 files to process. To not to face such situations, I wanted to leverage step functions that could run concurrent runs of the same glue job to distribute & process the 10 files in batches. :)
@rishabhreply Sorry for delayed response, but why do you think it will be slow. As those 10 files will not be processed sequentially. They will be processed in parallel by spark depending on available resource on Glue cluster.
@rishabhreply Let me know in case of any confusion on this. Feel free to close this issue in case you feel all good.
Closing this. Thanks.
Describe the problem you faced
It is not a problem but rather a question that I could not find in FAQs. Please let me know if it is unacceptable to ask here.
I have data coming in multiple files (let's say 10 files) for one table and all will have same value in partition_column. My setup is state machine with Glue parallelization enabled. Lets say I have set a batch size=2 and concurrency=5 in state machine, this will mean the state machine will trigger 5 parallel glue job instances and give each instance 2 files to process. I am using insert_overwrite hudi method.
Q1. In this setting how will Hudi work as not all glue job instances might finish at the same time? Will I see any Hudi errors? Or will it "overwrite" the data written by the glue job instances that finished earlier?
Environment Description
Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.