[SUPPORT] Benchmarking Hudi Simple Index and Record level indexing for COW

bibhu107 commented 3 months ago

We request the community to Benchmark Record Level Indexing (RLI) with Simple Indexing. The blog at https://hudi.apache.org/blog/2023/11/01/record-level-index/ provides a great comparison between RLI and Global Simple Indexing. However, we also need to understand how RLI compares with Simple Indexing, as RLI can be used for simple indexing in certain use cases, even though it's primarily designed for scenarios where record keys are unique across all partitions.

Our current approach is to hash the ContractId (hoodie_record_key), take the first three letters as partitions, and apply simple indexing. However, this approach doesn't scale well due to data skewness. The problem is to evaluate if RLI is suitable for our use case. If RLI isn't suitable, we need suggestions for a better indexing strategy.

Note: We currently use simple indexing instead of the costly global simple indexing. We can consider adopting RLI if it offers the same or reduced cost.

Environment Description

Hudi version : Currently using 0.7 might migrate to 0.14
Spark version : 3.3.1
Hive version : ApacheHive-3.1.3
Hadoop version : Hadoop-3.3.4
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No

Additional context : Running on , EMR-EC2

cc - @codope

ad1happy2go commented 3 months ago

@bibhu107 RLI is mainly designed to use as GLOBAL INDEX. So in your use case you may not need to create a custom partition column to improve performance. RLI will work good, as it identifies the file groups based on record index, so doesn't matter if data is partitioned or not.

bibhu107 commented 3 months ago

@ad1happy2go , Thank you, I will do some experiments on using Simple Indexing and RLI will attach here for reference

Thanks.

apache / hudi

[SUPPORT] Benchmarking Hudi Simple Index and Record level indexing for COW #11194