apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.48k stars 2.24k forks source link

Bump `HiveCatalog` hive-metastore dependency to Hive 4 #10429

Open ochanism opened 5 months ago

ochanism commented 5 months ago

Query engine

No response

Question

https://iceberg.apache.org/docs/1.5.2/configuration/#hadoop-configuration

image

I've been implementing a data ingester with Apache Iceberg 1.5.2 JAVA API. I faced a garbage hive lock issue with a hive-metastore catalog. I'm going to try to disable the hive lock according to the document as shown in the above screenshot. So I deployed a hive-metastore 4.0.0 server and tried to update catalog configs and dependencies.

# dependencies
org.apache.iceberg:iceberg-hive-metastore:1.5.2
org.apache.hive:hive-metastore:3.1.3

But iceberg-hive-metastore:1.5.2 couldn't be compiled with hive-metastore:4.0.0. (only worked with 3.1.3) I confirmed that the data ingester worked with the above dependencies (3.1.3) with hive-metastore 4.0.0 server. I wonder if this setup is OK. Or could be there some issues??

Fokko commented 5 months ago

Hey @ochanism

Thanks for reaching out. Hive 4.x supports Iceberg out of the box. Before an external Iceberg dependency was needed, but Hive 4+ ships with Iceberg directly. So the following should work:

create external table tbl_ice stored by iceberg tblproperties ('format-version'='2') as
select * from source;
ochanism commented 5 months ago

@Fokko Sorry for my ambiguous question. I'm using Trino as a query engine with hive-metastore catalog. And for the data ingestion (streaming), I developed a JAVA server with iceberg 1.5.2 API. To eliminate the hive lock, I updated hive-metastore from 3.1.3 to 4.0.0. And set the iceberg.engine.hive.lock-enabled=false for hive catalog property (HiveCatalog class). My JAVA server still has this dependency: org.apache.hive:hive-metastore:3.1.3. So I wonder if this setup is OK. (Is there could be any error due to hive-metastore version mismatch? client-library (3.1.3), real-server (4.0.0))

Fokko commented 5 months ago

@ochanism Thanks for clearing that up, that helps. Can you share the compilation error that you're seeing?

ochanism commented 5 months ago

@Fokko This error occurred while initializing hive catalog.

var catalog = new HiveCatalog();
catalog.initialize(this.catalogName, this.properties);
Caused by: java.lang.NoSuchFieldError: Class org.apache.hadoop.hive.conf.HiveConf$ConfVars does not have member field 'org.apache.hadoop.hive.conf.HiveConf$ConfVars METASTOREURIS'
    at org.apache.iceberg.hive.HiveCatalog.initialize(HiveCatalog.java:95)
# dependencies
org.apache.iceberg:iceberg-hive-metastore:1.5.2
org.apache.hive:hive-metastore:4.0.0

https://github.com/apache/iceberg/blob/252168419f8ac1d251b39f0944a189184056e543/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java#L95

Fokko commented 5 months ago

I see, the property has been updated since Hive 4: https://github.com/apache/hive/commit/b33b3d3454cc9c65a1879c68679f33f207f21c0e#diff-b7bbe8545a21ec7d7e9cfe40ef66444789e332996aaa9e7f1430dbe4822a2c9cR270

They suggest using the shaded dependency: https://github.com/apache/hive/pull/4919#issuecomment-2085197509

ochanism commented 5 months ago

Thanks for the information. Do you mean that Hive 4.0 with Iceberg is managed by Hive community? I want to use the latest Iceberg version, but the shaded jar used Iceberg 1.4.3. Is there any plan to update Iceberg library to support hive-metastore 4.0 catalog without the shaded jar?

Fokko commented 5 months ago

@ochanism The problem is that Hive is both a query engine and a metastore (catalog in Iceberg). The maintenance of the query engine (the support to read and write Iceberg), is now covered by the Hive community as of Hive 4. The catalog is still in the codebase of Iceberg, and will probably migrate at some point to Hive 4 as well. But I think that will take some time.

There is also another discussion going on in parallel. Since Iceberg has its own catalog (REST Catalog), it might be that the REST catalog becomes the preferred catalog, and the other ones become deprecated at some point. You could easily support a Hive catalog behind a REST catalog interface. Or even better, provide a native REST catalog interface by Hive itself (https://github.com/apache/hive/pull/5145).

pvary commented 5 months ago

@ochanism: If you are willing to take some risks, you might be able to create your own catalog implementation based on https://github.com/apache/hive/blob/master/iceberg/iceberg-catalog/src/main/java/org/apache/iceberg/hive/HiveCatalog.java and the current Iceberg HiveCatalog implementation. It will not be supported by any of the communities, but the code changes could be simple, like changing

    if (properties.containsKey(CatalogProperties.URI)) {
      this.conf.set(HiveConf.ConfVars.METASTORE_URIS.varname, properties.get(CatalogProperties.URI));
    }

to

    if (properties.containsKey(CatalogProperties.URI)) {
      this.conf.set(HiveConf.ConfVars.METASTOREURIS.varname, properties.get(CatalogProperties.URI));
    }

notice the missing _

ochanism commented 5 months ago

@Fokko Thanks for your kind explanation. I understood the current situation. And the plan for unifying catalogs with the REST catalog looks amazing. I hope that it will be available soon.

@pvary Thanks for your suggestion. I will try it and leave the result here after verifying it.

ochanism commented 5 months ago

@pvary I tried it, but many classes were in private or default scopes. So I had to copy so many class files to modify it. I decided to move REST with the JDBC catalog according to the @Fokko opinion (REST will be the preferred catalog in the future.). Thanks for helping me guys!

pan3793 commented 5 months ago

HIVE-26882 and HIVE-28121 have been landed in Hive 2.3.10, though Hive 2.3 is EOL, this version is adopted widely, e.g. by Spark, and Flink.