[SUPPORT] Low performance with upserts on S3 storage

floriandaniel commented 2 years ago

Problem I'm testing the ability of Apache Hudi to make upserts faster than the current functions on Spark. Each record contains 40 fields. The partitioning key is country_iso (a string field). There are 200 different values for this field. The partitions are quite unbalanced (US, China have much records). The problem is that I'm getting very slow performance with small datasets (~1Gb) I'm updating a string field which is not the partitioning key and the record key. The ratio of updates in my upsert dataset : 100%.

This could come from the way of partitioning my Parquet file, the unbalanced partitioning, choose another partitioning key ...

Environment Description

Hudi version : 0.11.1
Spark version : 3.1.2-amzn-1
Hive version :
Hadoop version : 3.2.1 (Amazon)
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
AWS EMR : emr-6.5.0, 1 master (r5.xlarge), 2 cores (r5d.2xlarge)

Additional context

Add any other context about the problem here.

Hudi Config

hoodie.index.type = BLOOM/SIMPLE
hoodie.bloom.index.prune.by.ranges = false
hoodie.metadata.enable = true
hoodie.enable.data.skipping = true
hoodie.metadata.index.column.stats.enable = true
hoodie.bloom.index.use.metadata = true

sample	src parquet (nb in millions) (size in Gb)	Updates (nb in millions) (size in Gb)	Upsert S3 - Simple index (time in mins)	Upsert S3 - Bloom index (time in mins)
1	8.7 M records (0.9 Gb)	0.35 M records (0.05 Gb)	1.80	1.88
10	87 M records (7.9 Gb)	3.5 M records (0.55 Gb)	10.5	21.5
25	217M records (18.7 Gb)	8.7 M records (1.1 Gb)	27.05	110.5

For example, the sample_10, I've got the following results :	index_type	2 most costly tasks
SIMPLE	Building workload profile: SIMPLE_hudi_sample_10 (countByKey at HoodieJavaPairRDD.java:104) -- 1.5 min Doing partition and writing data: SIMPLE_hudi_sample_10 (count at HoodieSparkSqlWriter.scala:643) -- 8.1 min
BLOOM	Building workload profile: BLOOM_hudi_sample_10 (countByKey at HoodieJavaPairRDD.java:104) -- 13 min -- IMAGE 1 Doing partition and writing data: BLOOM_hudi_sample_10 (count at HoodieSparkSqlWriter.scala:643) -- 8.0 min -- IMAGE 2

The image below show the partition /BN, with very small parquet files. partition_bn

Here is the Spark trace of an upsert with Bloom index (sample_10) trace bloom sample 10

IMAGE 1. Building workload profile: BLOOM_hudi_sample_10 (duration : 13 min), spark 2

IMAGE 2. Doing partition and writing data: BLOOM_hudi_sample_10, (duration : ~8mins) : spark executor

nsivabalan commented 2 years ago

@alexeykudinkin : can you take a look at this.

alexeykudinkin commented 2 years ago

Hey, @floriandaniel! Thanks for taking the time to file very detailed description.

First of all i believe the crux of the problem is likely lying in the realms of using Bloom Index of the Metadata table: we've recently identified a performance gap in there and @yihua is currently working on addressing that (there's already https://github.com/apache/hudi/pull/6432 in progress).

Second, i'd recommend you to do following in your evaluation:

Try Hudi 0.12 that has been recently released (we've done a lot of performance benchmarking/optimizations during last release cycle specifically to make sure Hudi's performance is top of the line)
Disable hoodie.bloom.index.use.metadata for now (until above fix lands)
Any particular reason you switching off hoodie.bloom.index.prune.by.ranges? It's very crucial aspect of using the Bloom Index that allows to prune the search space considerably for update-heavy workloads only checking the files that could contain the target records (and eliminating ones that couldn't)

apache / hudi

[SUPPORT] Low performance with upserts on S3 storage #6188