Closed SamarthRaval closed 5 months ago
@yihua Could you please guide me here
Consider using the timeline server. Its designed to faster marker management
@parisni Do you mean using this ? https://hudi.apache.org/docs/next/configurations#hoodiewritemarkerstype
Default is hoodiewritemarkerstype: TIMELINE_SERVER_BASED
Or should I specify explicitly ?
Using hudi 0.12.1
You are right this is likely the default. You can make sure by looking into the marker directory while writing process. .hoodie/.temp/commit
When timeline server used then few large files are appended. However one file per new parquet file is creates. In case of very large commit with many written files there is an overhead creating/dropping them.
Could you share a screenshot of your spark ui after job conpletion ?
On June 9, 2023 8:00:32 PM UTC, Samarth Raval @.***> wrote:
@parisni Do you mean using this ? https://hudi.apache.org/docs/next/configurations#hoodiewritemarkerstype
Default is hoodiewritemarkerstype: TIMELINE_SERVER_BASED
Or should I specify explicitly ?
Using hudi 0.12.1
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1585067556 You are receiving this because you were mentioned.
Message ID: @.***>
And the page after that
inside /.hoodie/.temp
20230515131506776 20230602131939547
s3:/
Then I confirm u use timeline server. Also from your stats I am not surprised writing 30k partition and updating so much files takes 8min (the doing partition and writing files job). The tagging and building profile also looks correct).
What is weird is the first web ui view. I wonder why you have those listings happening before starting the upsert. Since you have MDT enabled this cannot be listing partition for getting table file. Could you also share the first 40 jobs to understand what's going on with // file listing ?
On June 9, 2023 9:12:58 PM UTC, Samarth Raval @.***> wrote:
Also I have large number of partitions, last few commits look like this ~Approximate number of partition could be (20001012 ~24,000 - 30,000) [ Assuming this as I don't know how to count number of partitions]
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1585134643 You are receiving this because you were mentioned.
Message ID: @.***>
Sum of time from stage 0 to 57 takes 40 min. So 2 hours is spend after stage 57 has finished, can you confirm ? I' am not familiar with MOR table, so I'm not sure what's going on after committing. Likely not cleaning or compaction since it would show up a dedicated stage.
Have you looked at the executor logs to see if something happens there ?
On June 10, 2023 4:42:03 PM UTC, Samarth Raval @.***> wrote:
I have disabled the metadata [I was suspecting it is making my emr job super slow, but I was wrong, as its still take same amount of time]
This is first screen shot; 1st page:
2nd page screen shot: https://github.com/apache/hudi/issues/8925#issuecomment-1585103310
3rd page screen shot: https://github.com/apache/hudi/issues/8925#issuecomment-1585102673
Entire execution took around 2.5 hours:
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1585731130 You are receiving this because you were mentioned.
Message ID: @.***>
After stage 40 is over, EMR job just goes on idle state doing nothing for more then 1 hour, that is very weird and I cannot figure out what it is doing during that time.
Stag: 0 - 40 [Good performance] After that just idle for > 1 hr -> this is the problem Stag: 41 - 57 [Finishes off after that]
Compaction I am running async, cleaning never ran it [disabled it, could it be the reason not running cleaning for a long time could create a problem ?]
I can attach here some emr stats showing for some time it just went idle.
Core nodes stats:
Task nodes stats:
In both the above image you can see the drop where it just sits idle and core-task nodes both goes to zero and sits there, after some time it will come back again and stag: 41 - 57 executes and job finishes.
You can also see the I/O stats of emr cluster
Something happens during 1h37 between stage 43 and 44. It's not stage 40 as you previously said
disabled it, could it be the reason not running cleaning for a long time could create a problem
Yeah definitely a good things to investigate. Turn on auto cleaning with trigger each 1 commit w/o MDT and see if it improve after few upsert batch
Something happens during 1h37 between stage 43 and 44. It's not stage 40 as you previously said
Yes, you are correct, that is so weird and that makes the job slow, do we know why it happens, I have put the emr stats as well and I see the core-task nodes goes down as it does nothing, weird!!!
Do we know how to fix this idle state ?
@parisni
I tried cleaning stuff but in between stages it was idle for more then hour, where EMR job was idle and doing nothing, do you ever seen that before or someone has something like this before ?
Between stages 45 & 46 it was idle for almost ~1.2 hours
Did the cleaning eventually finished ? I already had such issue of slow cleaning and in appearance it does noting but spark is being dealing with s3 fs which mainly uses networking.
On June 11, 2023 2:53:08 AM UTC, Samarth Raval @.***> wrote:
@parisni
I tried cleaning stuff but in between stages it was idle for more then hour, where EMR job was idle and doing nothing, do you ever seen that before or someone has something like this before ?
Between stages 45 & 46 it was idle for almost ~1.2 hours
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1585982752 You are receiving this because you were mentioned.
Message ID: @.***>
Did the cleaning eventually finished ? I already had such issue of slow cleaning and in appearance it does noting but spark is being dealing with s3 fs which mainly uses networking. … On June 11, 2023 2:53:08 AM UTC, Samarth Raval @.> wrote: @parisni I tried cleaning stuff but in between stages it was idle for more then hour, where EMR job was idle and doing nothing, do you ever seen that before or someone has something like this before ? Between stages 45 & 46 it was idle for almost ~1.2 hours -- Reply to this email directly or view it on GitHub: #8925 (comment) You are receiving this because you were mentioned. Message ID: @.>
I think so cleaning did actually finish but the in-between stages 45 & 46 has significant delay(as mentioned above), if we know whats the problem then may be EMR job can finish <1.5 hrs [which would be best, and could help for other tables as well]
How many files/log files do you have in the partitions ?
On June 11, 2023 6:40:48 PM UTC, Samarth Raval @.***> wrote:
Did the cleaning eventually finished ? I already had such issue of slow cleaning and in appearance it does noting but spark is being dealing with s3 fs which mainly uses networking. … On June 11, 2023 2:53:08 AM UTC, Samarth Raval @.> wrote: @parisni I tried cleaning stuff but in between stages it was idle for more then hour, where EMR job was idle and doing nothing, do you ever seen that before or someone has something like this before ? Between stages 45 & 46 it was idle for almost ~1.2 hours -- Reply to this email directly or view it on GitHub: #8925 (comment) You are receiving this because you were mentioned. Message ID: @.>
I think so cleaning did actually finish but the in between stage 45 & 46 is the significant delay, if we know whats the problem there may be EMR job can finish <1.5 hrs [which would be best]
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1586277627 You are receiving this because you were mentioned.
Message ID: @.***>
How many files/log files do you have in the partitions ? … On June 11, 2023 6:40:48 PM UTC, Samarth Raval @.> wrote: > Did the cleaning eventually finished ? I already had such issue of slow cleaning and in appearance it does noting but spark is being dealing with s3 fs which mainly uses networking. > … > On June 11, 2023 2:53:08 AM UTC, Samarth Raval @.> wrote: @parisni I tried cleaning stuff but in between stages it was idle for more then hour, where EMR job was idle and doing nothing, do you ever seen that before or someone has something like this before ? Between stages 45 & 46 it was idle for almost ~1.2 hours -- Reply to this email directly or view it on GitHub: [#8925 (comment)](#8925 (comment)) You are receiving this because you were mentioned. Message ID: @.> I think so cleaning did actually finish but the in between stage 45 & 46 is the significant delay, if we know whats the problem there may be EMR job can finish <1.5 hrs [which would be best] -- Reply to this email directly or view it on GitHub: #8925 (comment) You are receiving this because you were mentioned. Message ID: @.>
Sorry, I am not really sure how can I give you exact numbers of files/log files/partitions ?
Do you know how can I calculate those ?
@parisni could you please guide me here ?
@SamarthRaval You can use aws s3 ls api to get the number of files and partitions.
@ad1happy2go @parisni
Number of partitions: 149,541
Number of files: 1,498,353
Size of the table: ~ 10-15 TB
Thanks. What's the distribution of files in partition ? For example how much files in the largest partition
On June 12, 2023 5:13:57 PM UTC, Rocksss @.***> wrote:
@ad1happy2go @parisni
Number of partitions: 149,541
Number of files: 1,498,353
Size of the table: ~ 10-15 TB
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1587739785 You are receiving this because you were mentioned.
Message ID: @.***>
Hey @SamarthRaval , based on the stacktrace and Spark UI screenshots you provided, it looks like that the time-taking part is in the meta sync / table refresh in Spark, which does not use the metadata table for file listing, even if the metadata is present in the Hudi table. Could you try adding this config and see if it improves the latency of the stage: hoodie.meta.sync.metadata_file_listing = true
?
@parisni
Number of files very so much inside each partitions,
largest partition has ~30,000 files.
see many partitions having ~5,000 to ~15,000 files.
Update - Had a discussion with @SamarthRaval . After cleaner the number of files decreased and he will try out upsert again along with fix Ethan suggested.
@SamarthRaval Gentle ping on this. Feel free to close if you able to resolve it.
@ad1happy2go Still didn't get a chance to test this, will update soon here.
Hello guys,
I got the chance to experiment with latest hudi 0.13.1 and enabled all metadata related config to enhance the performance.
"hoodie.metadata.enable" "hoodie.meta.sync.metadata_file_listing"
but still seeing the slow down, and spark server goes to idle state for more then an hour.
You can see the idle time in between stages which is weird and causing performance bottleneck.
@ad1happy2go @yihua @parisni
Slow down with detail spark UI
Hey @ad1happy2go @yihua any chance you guys can let us know what may be happening between this time (stages 56 and 57 in Sam's screenshot above)? We see 1h+ being lost here after the deltacommit file has been written. Very confused as to what may be happening here, our best assumption is marker file deletion in S3 but those are only a few thousand objects and maybe MBs in size... we don't think this should take 1hr+ in the pipeline. tysm for the help
@noahtaite @SamarthRaval Can you please get the driver logs when it getting hold
@ad1happy2go
Driver logs during delay time attached. One interesting thing to note is that the driver logs are unavailable/restarted shortly before the delay. We are also using EMR managed scaling and notice that the cluster goes down to 1 master, 5 core, 0 task nodes during this time (from a maximum of 40 task nodes during the job).
https://gist.github.com/noahtaite/e0309969c05ea3a825ed41a3f2065e21
@ad1happy2go
We ran a job with the same input data but disabled the Hive sync to AWS Glue functionality and this performance bottleneck / missing SHS stage was not observed. The job completed successfully in just 50mins (acceptable performance).
Please advise if there is a way to optimize AWS Glue sync. We noticed one flag was missing in our pipeline, "hoodie.datasource.hive_sync.use_jdbc" = "true"
even though "hoodie.datasource.hive_sync.mode" = "hms"
. We are attempting another test with the use_jdbc flag set to false.
@ad1happy2go @parisni @bhasudha @yihua @nsivabalan Could guys please help here.
If hive sync is slow maybe try hoodie.datasource.hive_sync.filter_pushdown_enabled
On August 10, 2023 8:09:07 PM UTC, Sam @.***> wrote:
@ad1happy2go @parisni @bhasudha @yihua @nsivabalan Could guys please help here.
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1673842511 You are receiving this because you were mentioned.
Message ID: @.***>
If hive sync is slow maybe try hoodie.datasource.hive_sync.filter_pushdown_enabled
On August 10, 2023 8:09:07 PM UTC, Sam @.***> wrote:
@ad1happy2go @parisni @bhasudha @yihua @nsivabalan Could guys please help here.
-- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1673842511 You are receiving this because you were mentioned.
Message ID: @.***>
Hello @parisni As you suggested I tried above config but started getting below error, which I never seen before
23/08/11 21:46:38 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:888) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:886) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:984) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:381) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
Likely this one only works w/ glue sync, not hive sync. BTW you could try using the new glue sync instead. Its more optimized for gku than HMS/jdbc api
On August 12, 2023 3:15:28 AM UTC, Sam @.***> wrote:
Hello @parisni As you suggested I tried above config but started getting below error, which I never seen before
23/08/11 21:46:38 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:61) at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:888) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:886) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:984) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:381) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:104) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:123) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:160) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:250) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:271) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:159) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:554) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:107) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:554) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:530) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:84) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:856) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:387) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:360) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:760) Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:165) at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:59) ... 56 more Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:429) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:280) at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:188) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:162) ... 57 more Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Partition fields and values should be same length, but got partitionFields: [] with values: [partition1, year1, month1] at org.apache.hudi.hive.util.PartitionFilterGenerator.lambda$generatePushDownFilter$5(PartitionFilterGenerator.java:187) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566) at org.apache.hudi.hive.util.PartitionFilterGenerator.generatePushDownFilter(PartitionFilterGenerator.java:192) at org.apache.hudi.hive.HiveSyncTool.getTablePartitions(HiveSyncTool.java:381) at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:423) ... 60 more -- Reply to this email directly or view it on GitHub: https://github.com/apache/hudi/issues/8925#issuecomment-1675650288 You are receiving this because you were mentioned. Message ID: ***@***.***>
Hello Guys, all my deltacommits are being written < 1hr but so much time is being wasted in deleting marker directory[shown in screenshot], but never got proper understanding why exactly it is happening ?
My configuration are as below:
hoodie.datasource.hive_sync.database -> prod_hudi_tier2, hoodie.datasource.hive_sync.mode -> hms, hoodie.datasource.hive_sync.support_timestamp -> true, path -> s3://transactions.all_hudi, hoodie.datasource.write.precombine.field -> lastmodifieddate, hoodie.datasource.hive_sync.partition_fields -> warehouse,year,month, hoodie.datasource.write.payload.class -> com.NullSafeDefaultHoodieRecordPayload, hoodie.datasource.hive_sync.skip_ro_suffix -> true, hoodie.metadata.enable -> true, hoodie.datasource.hive_sync.table -> transactions_all, hoodie.datasource.meta_sync.condition.sync -> true, hoodie.clean.automatic -> false, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> true, hoodie.datasource.write.recordkey.field -> uuid, hoodie.table.name -> transactions_all, hoodie.datasource.write.table.type -> MERGE_ON_READ, hoodie.datasource.write.hive_style_partitioning -> true, hoodie.datasource.write.reconcile.schema -> true, hoodie.datasource.write.keygenerator.class -> org.apache.hudi.keygen.ComplexKeyGenerator, hoodie.upsert.shuffle.parallelism -> 5760, hoodie.meta.sync.client.tool.class -> org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool, hoodie.datasource.write.partitionpath.field -> warehouse,year,month, hoodie.compact.inline.max.delta.commits -> 25
I am also storing in AWS glue if that is creating problem, no idea ? Or may be metadata is taking so much time ? This is slowing down entire pipeline.
I have put all the detail screenshot and information in slack message.
Please let me know if you still need information.
Slack Message