apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.44k stars 2.22k forks source link

cannot insert value in hive command shell #2442

Closed junsionzhang closed 6 months ago

junsionzhang commented 3 years ago

After creating iceberg in hive command shell successfully , there are errors when I try to insert values in hive shell. it seems the yarn lacks jar. what is the jar name and where should it be put?


va.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.iceberg.mr.hive.HiveIcebergOutputCommitter not found
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.iceberg.mr.hive.HiveIcebergOutputCommitter not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:545)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
    at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
    at org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.iceberg.mr.hive.HiveIcebergOutputCommitter not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2419)
    ... 12 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.iceberg.mr.hive.HiveIcebergOutputCommitter not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2299)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2393)
    ... 13 more
RussellSpitzer commented 3 years ago

From the docs

Add the Iceberg Hive Runtime jar file to the Hive classpath¶
Regardless of the table type, the HiveIcebergStorageHandler and supporting classes need to be made available on Hive’s classpath. These are provided by the iceberg-hive-runtime jar file. For example, if using the Hive shell, this can be achieved by issuing a statement like so:

add jar /path/to/iceberg-hive-runtime.jar;
There are many others ways to achieve this including adding the jar file to Hive’s auxiliary classpath (so it is available by default) - please refer to Hive’s documentation for more information.

https://iceberg.apache.org/hive/

Runtime jars can be pulled from maven if you like or built locally

junsionzhang commented 3 years ago

From the docs

Add the Iceberg Hive Runtime jar file to the Hive classpath¶
Regardless of the table type, the HiveIcebergStorageHandler and supporting classes need to be made available on Hive’s classpath. These are provided by the iceberg-hive-runtime jar file. For example, if using the Hive shell, this can be achieved by issuing a statement like so:

add jar /path/to/iceberg-hive-runtime.jar;
There are many others ways to achieve this including adding the jar file to Hive’s auxiliary classpath (so it is available by default) - please refer to Hive’s documentation for more information.

https://iceberg.apache.org/hive/

Runtime jars can be pulled from maven if you like or built locally

I have already add the jar in directory $HIVE_HOME/lib ,but question still exist. is it possible to update iceberg table data in hive command

RussellSpitzer commented 3 years ago

So you added the runtime jar to your hive home lib dir? Are you sure you added the jar there correctly and it is the same checksum as in the link I pasted? If so you could always turn on debugging in your Hive bin script by adding "set -x" to the beginning of it and see what Classpath it's building to make sure.

RussellSpitzer commented 3 years ago

You also may want to add the jar to "auxlib". The hive script it says serdes should be place there although It just appends those jars to the classpath so I think the lib directory should also work unless there is runtime classloader magic.

junsionzhang commented 3 years ago

I am sure the jar in the right location thank you @RussellSpitzer, when I create table XXX in flink client and run "describe formatted XXX" in hive shell, I can see the table XXX there but the information is wrong :

Storage Information SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.FileInputFormat OutputFormat: org.apache.hadoop.mapred.FileOutputFormat

junsionzhang commented 3 years ago

So you added the runtime jar to your hive home lib dir? Are you sure you added the jar there correctly and it is the same checksum as in the link I pasted? If so you could always turn on debugging in your Hive bin script by adding "set -x" to the beginning of it and see what Classpath it's building to make sure.

I can see it : /opt/jdk/bin/java -Xmx256m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.9.2/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.9.2 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Djava.library.path=/opt/hadoop-2.9.2/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dproc_hivecli -Dlog4j.configurationFile=hive-log4j2.properties -Djava.util.logging.config.file=/opt/hive/conf/parquet-logging.properties -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar /opt/hive/lib/hive-cli-2.3.6.jar org.apache.hadoop.hive.cli.CliDriver --hiveconf hive.aux.jars.path=file:///opt/hive/auxlib/iceberg-hive-runtime-0.11.0.jar

RussellSpitzer commented 3 years ago

Tables created by other systems as "hive" and not "hadoop" tables have a bit more setup

The first step is to create an Iceberg table using the Spark/Java/Python API and HiveCatalog. For the purposes of this documentation we will assume that the table is called table_b and that the table location is s3://some_path/table_b. In order for Iceberg to correctly set up the Hive table for querying some configuration values need to be set, the two options for this are described below - you can use either or the other depending on your use case.

Hive Configuration¶
The value iceberg.engine.hive.enabled needs to be set to true and added to the Hive configuration file on the classpath of the application creating or modifying (altering, inserting etc.) the table. This can be done by modifying the relevant hive-site.xml. Alternatively this can be done programmatically like so:

Configuration hadoopConfiguration = spark.sparkContext().hadoopConfiguration();
hadoopConfiguration.set(ConfigProperties.ENGINE_HIVE_ENABLED, "true"); //iceberg.engine.hive.enabled=true
HiveCatalog catalog = new HiveCatalog(hadoopConfiguration);
...
catalog.createTable(tableId, schema, spec);
Table Property Configuration¶
The property engine.hive.enabled needs to be set to true and added to the table properties when creating the Iceberg table. This can be done like so:

    Map<String, String> tableProperties = new HashMap<String, String>();
    tableProperties.put(TableProperties.ENGINE_HIVE_ENABLED, "true"); //engine.hive.enabled=true
    catalog.createTable(tableId, schema, spec, tableProperties);
pvary commented 3 years ago

As a first try I would stick to the add jar ... solution. Hive has plenty of moving parts and it could be painful to add jars on other ways. OTOH add jar ... should be working so I would stick to that and when that works then I would move forward and try the classpath stuff which is usually deployment dependent.

The other issue, where the table created through Flink does not have the correct InputFormat / OutputFormat seems like another thing where Flink does not set the hive.engine.enabled table property on table creation.

vinitamaloo-asu commented 1 year ago

I created a new catalog "iceberg_catalog" using spark config like below: .set("spark.sql.catalog.iceberg_catalog", "org.apache.iceberg.spark.SparkCatalog") .set("spark.sql.catalog.iceberg_catalog.type", "hive")

Now to create iceberg tables, I also initialized a hive catalog with the same catalog name and properties which is redundant.

`
val catalog = new HiveCatalog()

catalog.setConf(conf)

catalog.initialize(

 "iceberg_catalog",

  JavaConverters.mapAsJavaMap(Map(

    CatalogProperties.CATALOG_IMPL -> "org.apache.iceberg.hive.HiveCatalog",

    CatalogProperties.URI -> "thrift://localhost:9083",

    CatalogProperties.WAREHOUSE_LOCATION -> warehouseUri

  ))`

Is there a way to get the previously initialized catalog with spark conf?

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 6 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'