apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.2k stars 2.38k forks source link

[SUPPORT] - Performance Variation in Hudi 0.14 #11481

Open RuyRoaV opened 2 weeks ago

RuyRoaV commented 2 weeks ago

Describe the problem you faced

We have a Glue 4.0 job to perform an upsert on a Hudi managed COW table. In some occasions, the Glue job runs in under 5 minutes, whereas in others it runs for up to 20 minutes. Moreover, we have noticed that, in those instances, the job is performing a count at HoodieSparkSqlWriter.scala:1072 action for over 17 minutes; in other job runs this only takes around 1 minute.

Regarding some specifications for the table:

We have 3 partition fields:

A precombine field:

and 3 recordkey fields:

You can see more about the table description here:

Screenshot 2024-06-21 at 13 31 19

We are also using a BLOOM type index and these are some other configurations that we are setting.

Screenshot 2024-06-21 at 13 12 16

Could you please advise us on which actions we should take to bring down the execution time?

Expected behavior

We would like to understand why we are looking this variation in the execution times and advice on the actions needed

to prevent this behaviour.

Environment Description

ad1happy2go commented 2 weeks ago

@RuyRoaV Can you provide event logs or spark UI.

On configurations, I recommend not to use archive beyond save point. You can also try to use SIMPLE index once. As for some of the usecases where most of the file groups are updated, SIMPLE index perform much better.

RuyRoaV commented 2 weeks ago

Hello @ad1happy2go

I have attached some screenshots of the Spark UI. Is there any specific screen that you'd like to see?

Screenshot 2024-06-24 at 13 23 33

Screenshot 2024-06-24 at 13 23 53

Thanks for the input, will take that into account. I've also seen on some other GitHub issues, seen changing to and RLI index being recommended. Would that work for a COW table? or would the SIMPLE index still be a better approach?

Best regards,

ad1happy2go commented 2 weeks ago

@RuyRoaV RLI will work if you need global index. It works for COW table as well.

RuyRoaV commented 5 days ago

Hi Aditya

I have tried out your recommendation and found the following:

Using SIMPLE INDEX

The average execution time was reduced from 20 min to around 11 min, which is great. In the Spark UI screenshot, you can see that a big percentage of the execution time is taken by a countByKey at JavePairRDD action in the SparkCommitUpsert executor, especially during the SuffleWrite part.

Screenshot 2024-07-03 at 16 44 57 Screenshot 2024-07-03 at 16 52 19 Screenshot 2024-07-03 at 16 52 46

We are in a need to reduce the job runtime even more, is there any other recommendation regarding the different configurations that we can set?

We may try deactivating of the archival beyond the savepoint a bit later. But I am curious about why would that help us improve in performance?

Using RECORD LEVEL

I replaced the index for a table, for which its upsert Glue job was already running in under 5 minutes. Overall, the job runtime has remained the same, being count at HoodieSparkSqlWriter.scala:1072 during the SparkCommitUpsert, especially during the execution. This is similar as in the case presented when submitting this ticket.

I'll try with one of our long running jobs and will let you know the outcome.

By the way is there a way to check the index type of a table?

Thanks

Best regards