[SUPPORT] Slow in writing stage for upsert with Hudi 0.12.3

ChiehFu commented 1 year ago

Hello,

Recently we migrated our datasets from Hudi 0.8 to Hudi 0.12.3 and started experiencing slowness in writing stage where parquet files are being writing to S3.

Below numbers were observed on a COW table of 12 GB in size and has 10 partitions with parquet file size roughly lying between 30MB - 300MB.

In an upsert job of 27,679 records with a total size of 26.8MB, we observed that each task in writing stage was taking up to 10 mins to write parquet file of size ranging from 30MB to 300MB. Individual task duration seems directly correlated to the size of the parquet file the task wrote, which makes sense, however, spending 10 mins on writing a 300MB parquet file into S3 seems extremely long.

Can you please help us understand what might be causing such slowness in writing stage and if there is a way to improve the performance here?

Complete spark job:

Writing stage:

Hudi commit metadata for the upsert job:

Environment Description

Hudi version : 0.12.3

Spark version : 3.1.3

Hive version : 3.1.3

Hadoop version : 3.3.3

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : no

EMR: 6.10.0/6.10.1

Additional context

Hudi configs

hoodie.metadata.enable: true hoodie.metadata.validate: true hoodie.cleaner.commits.retained: 72 hoodie.keep.min.commits: 100 hoodie.keep.max.commits: 150 hoodie.datasource.write.payload.class: org.apache.hudi.common.model.DefaultHoodieRecordPayload hoodie.index.type: BLOOM hoodie.bloom.index.parallelism: 2000 hoodie.metadata.enable: true hoodie.datasource.write.table.type: COPY_ON_WRITE hoodie.insert.shuffle.parallelism: 500 hoodie.datasource.write.operation: upsert hoodie.datasource.hive_sync.partition_extractor_class: org.apache.hudi.hive.MultiPartKeysValueExtractor hoodie.datasource.write.keygenerator.class: org.apache.hudi.keygen.ComplexKeyGenerator

ad1happy2go commented 1 year ago

@ChiehFu Is your incremental touching all the partitions/file groups ? If it touches lot of file groups that means it has to rewrite lot of parquet files again.

Can you open this stage (Doing Partition and Writing data). Can you attach the screenshot also for the same.

ChiehFu commented 1 year ago

@ad1happy2go

Yes, this particular upsert touched all 10 partitions of the table. Where should I check which file groups were touch?

This is the stage details for "Doing Partition and Writing data" stage which had 56 tasks writing 56 parquet files.

ad1happy2go commented 1 year ago

@ChiehFu You can check the .commit file in the timeline to see how many file groups it have touched.

To explain more, COW table will rewrite all the file groups again for which it have any update coming. So even if your incremental data is small but if updates in that overlap lot of file groups it may be slow.

What is your record key, Is it the random id?

ChiehFu commented 1 year ago

@ad1happy2go I was able find this table of commit details via Hudi CLI, which I think displays the similar information found in hudi commit file

The record key is a composite key that consists of 9 columns of string and number types, so it would be pretty random rather than having any kind of ordering.

The explanation makes sense, and I think tables where this kind of slowness is observed are tables either having a single partition, or tables where incremental updates touch majority of partitions.

The part that I couldn't figure out is that we had those tables were initially created as Hudi 0.8 table and they had been running incremental upsert for quite a while without seeing any sign of slowness in writing stage. It was until recently we upgraded them to Hudi 0.12.3 we started seeing slowness in writing stage while there is no significant change to the trait of incremental updates.

So I guess below two questions came to my mind:

Is data writing stage somehow slower in Hudi 0.12 compared to Hudi 0.8 in the case of multiple partitions/file groups are touched?
Is there any fundamental difference between Hudi 0.8 and Hudi 0.12 in terms of how records get bin-packed into file groups during rewrites such that write amplification for incremental upsert is significantly large in Hudi 0.12 compared to Hudi 0.8 even though there hasn't been much change to the trait of the incremental updates?

ad1happy2go commented 1 year ago

@ChiehFu I don't think there is a significant fundamental difference in processing between 0.8 and 0.12 apart from metadata table. How much difference you are seeing between these two versions .

ChiehFu commented 1 year ago

@ad1happy2go I think overall we observed an increase of up to 100% in job duration of upserts jobs after upgraded to Hudi 0.12.3 and these is no significant change in data size of the upsert jobs.

Below chart is the duration of 10-min incremental upsert jobs measured in minutes, and the upgrade was done on 11/11 from which point we started seeing increases in job durations.

Also just generally speaking, is it normal for the task of writing stage "Doing partition and writing data (count at HoodieSparkSqlWriter.scala:721)" to take up to 10 mins to write parquet file of size like 100MB -300MB into s3 storage? I wonder what else could contribute to the time taken in each of the task of that particular stage while writing parquet files.

apache / hudi

[SUPPORT] Slow in writing stage for upsert with Hudi 0.12.3 #10121