Open Aitozi opened 3 months ago
CC @JingsongLi WDYT ?
@Aitozi Sounds good to me.
@Aitozi @JingsongLi
IMO, while MOD mode can solve the issue here, its applicable scenarios are limited.
I think Iceberg's truncate partitioning might be a better solution. It can also solve the problem here and is applicable to a wider range of types.
It also has the advantage of maintaining data continuity (for numerical types like INT
, LONG
), which is very beneficial for primary key tables using auto-increment IDs. For STRING
types, it can cluster records with the same prefix, which is beneficial for filtering with starts_with
query predicates.
@zhongyujiang Thank you for your input. It seems the term "bucket" refer to the different concept in Iceberg and Paimon. The "bucket" is one way to transform partitions in Iceberg.
I agree that the truncate
transform is more flexible. How about we also consider supporting the bucket's generate function? For example:
hash(x) % N
truncate(x) -- with different types having different implementations
CC @JingsongLi
The "bucket" is one way to transform partitions in Iceberg.
Yea, I think Iceberg refers to all methods of dividing datasets as partitioning.
+1 on introducing the truncate transform.
I wonder if it would be better not to bind concepts like truncate and hash bucketing to Paimon bucket, as this could lead to more flexible partitioning methods, such as truncate(col1, width), bucket(col2, numBuckets)
.
Search before asking
Motivation
Currently, when we define the
bucket-key
in the Paimon table, the bucket is generated usinghash(bucket-key) % bucket_number
. However, there are cases where we have already determined the target bucket at the computing edge.For instance, let's consider a Paimon table that stores user ID bitmaps. We aim to distribute users into 32 buckets evenly. Therefore, we calculate a
bucket_no
key usinghash(user_id) % 32
. Yet, under the current mechanism, it will be recalculated ashash(bucket_no) % bucket_number
, potentially causing data skew within buckets due to hash collisions.So, what about allowing the definition of the bucket generator eg:
By this, we can directly use
MOD
to distribute the data.Solution
No response
Anything else?
No response
Are you willing to submit a PR?