[SUPPORT] Unable to query Partitioned COW Hudi tables with metadata enabled using Trino-Hudi Connector

codope commented 1 year ago

Describe the problem you faced Original issue: https://github.com/trinodb/trino/issues/15368

Our team is testing the same on COPY ON WRITE HUDI (0.10.1) tables with metadata enabled at version using Trino 400. And we are facing the error while reading from partitioned tables. Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.

The issue was resolved by placing some dependencies in the classpath. Interestingly, those dependencies are already included in the trino-hudi-bundle. This particular issues tracks any gap in packaging.

To Reproduce

Steps to reproduce the behavior:

Write a Hudi COW table with the below properties and metadata enabled.
Query the same table using the trino-hudi connector (properties mentioned below) with hudi.metadata-enabled=true.

Trino Hudi Connector Properties:

connector.name=hudi
hive.metastore.uri={METASTORE_URI}
hive.s3.iam-role={S3_IAM_ROLE}
hive.metastore-refresh-interval=2m
hive.metastore-timeout=3m
hudi.max-outstanding-splits=1800
hive.s3.max-error-retries=50
hive.s3.connect-timeout=1m
hive.s3.socket-timeout=2m
hudi.parquet.use-column-names=true
hudi.metadata-enabled=true

Hudi Properties set while writing:

hoodie.datasource.write.partitionpath.field = "insert_ds_ist",
hoodie.datasource.write.recordkey.field = "id",
hoodie.datasource.write.precombine.field = "_hoodie_incremental_key", (self generated column),
hoodie.datasource.write.hive_style_partitioning = "true",
hoodie.datasource.hive_sync.auto_create_database = "true",
hoodie.parquet.compression.codec = "gzip",
hoodie.table.name = "<table_name>",
hoodie.datasource.write.keygenerator.class = "org.apache.hudi.keygen.SimpleKeyGenerator",
hoodie.datasource.write.table.type = "COPY_ON_WRITE",
hoodie.metadata.enable = "true",
hoodie.datasource.hive_sync.enable = "true",
hoodie.datasource.hive_sync.partition_fields = "insert_ds_ist",
hoodie.datasource.hive_sync.partition_extractor_class = "org.apache.hudi.hive.MultiPartKeysValueExtractor"

General information of table: Total rows = 1,213,959,199 Total Partitions = 2400+ Total file objects = 120,000 Total Size on S3 = 12~13 GB The table was upgraded from 0.9.0 to 0.10.1

Coordinator Relevant Logs:

Expected behavior

They query should work out-of-the-box without having to place jars in classpath.

Environment Description

Hudi version : 0.10.1
Spark version : 2.4
Trino version : 400
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) : no

Additional context

Add any other context about the problem here.

Stacktrace

Full stacktrace in Partitioned_COW_Hudi_Coordinator_logs.log

codope commented 1 year ago

trino-hudi module adds hudi-common, hudi-hadoop-mr, hudi-client-common individually. Instead, we should consider replacing the three dependencies with the hudi-trino-bundle.

codope commented 1 year ago

Current workaround is to add the hudi-trino-bundle in plugin path (<trino_install_dir>/plugin/hudi).

apache / hudi

[SUPPORT] Unable to query Partitioned COW Hudi tables with metadata enabled using Trino-Hudi Connector #7583