apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.23k stars 2.17k forks source link

Table has more than one bucket keys, but "show create table xxx" only displays one #11090

Open madeirak opened 3 weeks ago

madeirak commented 3 weeks ago

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

image Through "select * from xx.xx.partitions" above, it can be seen that this table has two bucket keys. But "show create table xx.xx"as below,only display one bucket key image

manuzhang commented 1 week ago

The table has two partition keys from two partition transforms, one of which is bucket.

madeirak commented 1 week ago

The table has two partition keys from two partition transforms, one of which is bucket.

image Are these two partition transforms equivalent? name_bucket_10 and id_bucket_10

Are the principle both hash?

manuzhang commented 1 week ago

Sorry, I missed name_bucket_10 part. How did you create your table? With which catalog?

madeirak commented 1 week ago

Sorry, I missed name_bucket_10 part. How did you create your table? With which catalog?

Similar to the following process:

create table   dbxx.tbxx (id INT COMMENT '11', name STRING COMMENT '') USING iceberg PARTITIONED BY (name, bucket(10, name), bucket(10, id ));
insert into tbxx values (1, '1');
show create table dbxx.tbxx ;
select * from dbxx.tbxx.partitions;
madeirak commented 1 week ago

Sorry, I missed name_bucket_10 part. How did you create your table? With which catalog?

With HiveCatalog

lurnagao-dahua commented 1 week ago

create table dbxx.tbxx (id INT COMMENT '11', name STRING COMMENT '') USING iceberg PARTITIONED BY (name, bucket(10, name), bucket(10, id )); insert into tbxx values (1, '1'); show create table dbxx.tbxx ; select * from dbxx.tbxx.partitions;

I am quite puzzled why name is used as both partition and bucket. In this case, all the data under the name partition is in the same bucket, and the bucketing effect is meaningless.

madeirak commented 1 week ago

create table dbxx.tbxx (id INT COMMENT '11', name STRING COMMENT '') USING iceberg PARTITIONED BY (name, bucket(10, name), bucket(10, id )); insert into tbxx values (1, '1'); show create table dbxx.tbxx ; select * from dbxx.tbxx.partitions;

I am quite puzzled why name is used as both partition and bucket. In this case, all the data under the name partition is in the same bucket, and the bucketing effect is meaningless.

This is just an example, not a real table. The main issue is that multiple bucket fields only display one in "show create table xxx"

manuzhang commented 1 week ago

The show create table result is following Spark SQL syntax, which only supports one bucket field.

madeirak commented 1 week ago

The show create table result is following Spark SQL syntax, which only supports one bucket field.

ok, fine. It would be better if it could be as shown in the Iceberg document: imageref: https://iceberg.apache.org/docs/latest/spark-ddl/#partitioned-by