databricks / iceberg-kafka-connect

Apache License 2.0
220 stars 49 forks source link

Weird partitioning keys break compaction job. #246

Open themoah opened 6 months ago

themoah commented 6 months ago

Hi,

I'm running Iceberg DL on s3 (with Glue), iceberg-kafka-connect version 0.6.5 running on MSK (both Kafka and Kafka connect run there). Table is partitioned by

PARTITIONED BY (tenant_key, month(timestamp)) 

And this is what I get in s3:

$aws s3 ls s3://*****-datalake-prd/iceberg/****.db/DB_NAME/data/
                           PRE --ed3A/
                           PRE -02JMg/
                           PRE -1YYGw/
                           PRE -1ZA3g/
                           PRE -3zWmA/
                           PRE -4nA1A/
                           PRE -514YA/
                           PRE -57eFQ/
                           PRE -6vp7g/
                           PRE -7Nd6g/
                           PRE -8F8Ow/
                           PRE -8Slgw/
                           PRE -8wpIQ/
                           PRE -9VIJw/
                           PRE -9ZBQw/
                           PRE -9Z__w/
                           PRE -9fWaw/
                           PRE -Ateag/
                           PRE -BP8WQ/
                           PRE -BakQg/
                           PRE -CMHhw/

Thousands of those. Inside each and everyone there is correct partitioning by tenant_key and then month. Connector configuration:

connector.class=io.tabular.iceberg.connect.IcebergSinkConnector
iceberg.control.group-id=tf-prd-use1-iceberg-TABLE_NAME
iceberg.catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
iceberg.catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
tasks.max=1
topics=KAFKA_TOPIC
iceberg.catalog.client.region=us-east-1
key.converter.schemas.enable=false
iceberg.tables=prd_use1_TABLE_NAME
value.converter.schemas.enable=false
iceberg.catalog.warehouse=s3a://*****-datalake-prd/iceberg/prd_use1_TABLE_NAME
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.json.JsonConverter

So not only it breaks the compaction, but also querying is very slow. Any ideas where misconfiguration comes from ? Table was initiated via Athena.