apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[SUPPORT] Unable to query Partitioned COW Hudi tables with metadata enabled using Trino-Hudi Connector #7583

Open codope opened 1 year ago

codope commented 1 year ago

Describe the problem you faced Original issue: https://github.com/trinodb/trino/issues/15368

Our team is testing the same on COPY ON WRITE HUDI (0.10.1) tables with metadata enabled at version using Trino 400. And we are facing the error while reading from partitioned tables. Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.

The issue was resolved by placing some dependencies in the classpath. Interestingly, those dependencies are already included in the trino-hudi-bundle. This particular issues tracks any gap in packaging.

To Reproduce

Steps to reproduce the behavior:

  1. Write a Hudi COW table with the below properties and metadata enabled.
  2. Query the same table using the trino-hudi connector (properties mentioned below) with hudi.metadata-enabled=true.

Trino Hudi Connector Properties:

connector.name=hudi
hive.metastore.uri={METASTORE_URI}
hive.s3.iam-role={S3_IAM_ROLE}
hive.metastore-refresh-interval=2m
hive.metastore-timeout=3m
hudi.max-outstanding-splits=1800
hive.s3.max-error-retries=50
hive.s3.connect-timeout=1m
hive.s3.socket-timeout=2m
hudi.parquet.use-column-names=true
hudi.metadata-enabled=true

Hudi Properties set while writing:

hoodie.datasource.write.partitionpath.field = "insert_ds_ist",
hoodie.datasource.write.recordkey.field = "id",
hoodie.datasource.write.precombine.field = "_hoodie_incremental_key", (self generated column),
hoodie.datasource.write.hive_style_partitioning = "true",
hoodie.datasource.hive_sync.auto_create_database = "true",
hoodie.parquet.compression.codec = "gzip",
hoodie.table.name = "<table_name>",
hoodie.datasource.write.keygenerator.class = "org.apache.hudi.keygen.SimpleKeyGenerator",
hoodie.datasource.write.table.type = "COPY_ON_WRITE",
hoodie.metadata.enable = "true",
hoodie.datasource.hive_sync.enable = "true",
hoodie.datasource.hive_sync.partition_fields = "insert_ds_ist",
hoodie.datasource.hive_sync.partition_extractor_class = "org.apache.hudi.hive.MultiPartKeysValueExtractor"

General information of table: Total rows = 1,213,959,199 Total Partitions = 2400+ Total file objects = 120,000 Total Size on S3 = 12~13 GB The table was upgraded from 0.9.0 to 0.10.1

Coordinator Relevant Logs:

Expected behavior

They query should work out-of-the-box without having to place jars in classpath.

Environment Description

Additional context

Add any other context about the problem here.

Stacktrace

Full stacktrace in Partitioned_COW_Hudi_Coordinator_logs.log

codope commented 1 year ago

trino-hudi module adds hudi-common, hudi-hadoop-mr, hudi-client-common individually. Instead, we should consider replacing the three dependencies with the hudi-trino-bundle.

codope commented 1 year ago

Current workaround is to add the hudi-trino-bundle in plugin path (<trino_install_dir>/plugin/hudi).