apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] Performance of Snapshot Exporter #6970

Closed ft-bazookanu closed 1 year ago

ft-bazookanu commented 1 year ago

Increasing spark.executor.memory or spark.executor.cores worsens performance of HUDI Exporter

To Reproduce

Steps to reproduce the behavior:

  1. Run the HUDI exporter varying spark.executor.instances, spark.executor.memory and spark.executor.cores image

Expected behavior

  1. Performance should not worsen if we increase spark.executor.memory and spark.executor.cores while keeping spark.executor.instances constant.

We also hoped to have better performance in general, on par with s3 cp. What can we do to improve Exporter's performance?

Environment Description

Additional context

nsivabalan commented 1 year ago

can you try setting partitioner --output-partitioner this should help improve performance.

also, are you setting ay value for --output-partition-field ?

ft-bazookanu commented 1 year ago

@nsivabalan thanks for the response. --output-format is set to hudi so --output-partitioner is not an option. Is there anything else we can do?

nsivabalan commented 1 year ago

can you try setting --output-partition-field. that should also repartition your data based on this field and will increase your parallelism.

ft-bazookanu commented 1 year ago

Please see https://hudi.apache.org/docs/snapshot_exporter/ partitioner configs are ignored when the output format is hudi. Moreover we're using this as a backup and do not want to repartition. I feel my issue is orthogonal to partitioning:

Most of the time is spent exporting the contents of .hoodie/, which appears to be happening serially (not parallel).

xushiyan commented 1 year ago

Please see https://hudi.apache.org/docs/snapshot_exporter/ partitioner configs are ignored when the output format is hudi. Moreover we're using this as a backup and do not want to repartition. I feel my issue is orthogonal to partitioning:

  • why does performance decrease on increasing memory/cores per executor?
  • why does performance saturate at 16 executors, although the table has far more than 16 partitions?

Most of the time is spent exporting the contents of .hoodie/, which appears to be happening serially (not parallel).

@ft-bazookanu thanks for raising the problems. there are a few points for perf improvements in exporter, for example, the parallelism was not set properly for list partitions and base files to copy. also when copy commit files under .hoodie/ , it is currently made as serial in a for loop. This jira HUDI-712 was filed long time back and deprioritized. Now we can pick it up since it benefits real use cases now.

Gatsby-Lee commented 2 months ago

@xushiyan

Q1. As of version 0.13.0, the default shuffle parallelism is set to 0 (as per the documentation). Could your PR and the updated default shuffle parallelism negatively affect the exporter's performance?

Q2. what is the recommended value for this config? Q3. when should this config be set by user?

So far, I have noticed that the number of output partitions is significantly lower than before.