[Feature] Introduce mod bucket generator to paimon

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

https://paimon.apache.org/

Apache License 2.0

2.35k stars 927 forks source link

[Feature] Introduce mod bucket generator to paimon #3538

Open Aitozi opened 3 months ago

Aitozi commented 3 months ago

Search before asking

[X] I searched in the issues and found nothing similar.

Motivation

Currently, when we define the bucket-key in the Paimon table, the bucket is generated using hash(bucket-key) % bucket_number. However, there are cases where we have already determined the target bucket at the computing edge.

For instance, let's consider a Paimon table that stores user ID bitmaps. We aim to distribute users into 32 buckets evenly. Therefore, we calculate a bucket_no key using hash(user_id) % 32. Yet, under the current mechanism, it will be recalculated as hash(bucket_no) % bucket_number, potentially causing data skew within buckets due to hash collisions.

So, what about allowing the definition of the bucket generator eg:

HASH_MOD
MOD

By this, we can directly use MOD to distribute the data.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[X] I'm willing to submit a PR!

Aitozi commented 3 months ago

CC @JingsongLi WDYT ?

JingsongLi commented 3 months ago

@Aitozi Sounds good to me.

zhongyujiang commented 3 months ago

@Aitozi @JingsongLi
IMO, while MOD mode can solve the issue here, its applicable scenarios are limited.

I think Iceberg's truncate partitioning might be a better solution. It can also solve the problem here and is applicable to a wider range of types.

It also has the advantage of maintaining data continuity (for numerical types like INT, LONG), which is very beneficial for primary key tables using auto-increment IDs. For STRING types, it can cluster records with the same prefix, which is beneficial for filtering with starts_with query predicates.

Aitozi commented 3 months ago

@zhongyujiang Thank you for your input. It seems the term "bucket" refer to the different concept in Iceberg and Paimon. The "bucket" is one way to transform partitions in Iceberg. I agree that the truncate transform is more flexible. How about we also consider supporting the bucket's generate function? For example:

hash(x) % N
truncate(x) -- with different types having different implementations

CC @JingsongLi

zhongyujiang commented 3 months ago

The "bucket" is one way to transform partitions in Iceberg.

Yea, I think Iceberg refers to all methods of dividing datasets as partitioning.

+1 on introducing the truncate transform.

I wonder if it would be better not to bind concepts like truncate and hash bucketing to Paimon bucket, as this could lead to more flexible partitioning methods, such as truncate(col1, width), bucket(col2, numBuckets).