apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.26k stars 2.18k forks source link

Table does not exist at location Error when hive beeline query iceberg tables with hive catalog created by spark job #2927

Closed TangYan-1 closed 2 months ago

TangYan-1 commented 3 years ago

I've enabled iceberg.engine.hive.enabled=true in hive-site.xml.

  1. in spark sql, create iceberg tables with hive catalog spark3-shell --jars iceberg-spark3-runtime-0.11.1.jar \ --conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \ --conf spark.sql.catalog.spark_catalog.type=hive

    spark.sql("CREATE TABLE perftest.iceberg_table_hive_createdbyspark2 (id bigint, data string) USING iceberg")

  2. in beeline, select * from perftest.iceberg_table_hive_createdbyspark2;

Errors: Error while compiling statement: FAILED: SemanticException Table does not exist at location: hdfs://nameservice1/user/hive/warehouse/perftest.db/iceberg_table_hive_createdbyspark2

![Uploading Screen Shot 2021-08-03 at 4.29.50 PM.png…]()

TangYan-1 commented 3 years ago
Screen Shot 2021-08-03 at 4 29 50 PM
TangYan-1 commented 3 years ago

My hive version is 2.1.1-cdh6.3.0

pvary commented 3 years ago

@TangYan-1: Based on the DESCRIBE TABLE output, I guess the problem is that you have to set the catalog configuration in Hive as well as a session config before executing the query:

set iceberg.mr.catalog = hive;
TangYan-1 commented 3 years ago

@pvary Thanks for your quick reply. The query can work after adding 'set iceberg.mr.catalog = hive;'. But I found a new issue, when I use SELECT * from iceberg_table_hive_createdbyspark2 where id='1', the query fails in the below new errrors. java.lang.NoSuchMethodError: org.apache.hadoop.hive.ql.io.sarg.ConvertAstToSearchArg.create(Lorg/apache/hadoop/conf/Configuration;Lorg/apache/hadoop/hive/ql/plan/ExprNodeGenericFuncDesc;)Lorg/apache/hadoop/hive/ql/io/sarg/SearchArgument; at org.apache.iceberg.mr.hive.HiveIcebergInputFormat.getSplits(HiveIcebergInputFormat.java:59) ~[iceberg-hive-runtime-0.11.1.jar:?] at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:349) ~[hive-exec-2.1.1-cdh6.3.0.jar:2.1.1-cdh6.3.0] Could you please share any clues?

pvary commented 3 years ago

ConvertAstToSearchArg does not have the method with the correct parameters available. https://issues.apache.org/jira/browse/HIVE-15269 introduces this parameters and it is only available since Hive 2.2.0 😢

pvary commented 3 years ago

The Apache Iceberg tests are run with Hive 2.3.8 and Hive 3.1.2 so you can try it out with those versions, and we are also working on a CDH version.

TangYan-1 commented 3 years ago

Thanks, @pvary Seems I can't find any info about set iceberg.mr.catalog in official documentation. Could you share some more details about this configuration? such as what's the default value?

pvary commented 3 years ago

The documentation is updated for Iceberg 0.12 which is problematic since the released version is 0.11.1. I will ask around this on the dev list.

pvary commented 3 years ago

@TangYan-1: @nastra suggested on the dev list that you can find the doc like this: https://github.com/apache/iceberg/blob/0.11.x/site/docs/hive.md

I hope this helps

TangYan-1 commented 3 years ago

@pvary Thanks for your guide. I'm now trying to use iceberg with hadoop catalog. I created a iceberg table using spark job, and create an overlay in hive. When I use presto/trino to query that table, it show the below error. Do you have any ideas?

  1. spark3-shell --jars iceberg-spark3-runtime-0.11.1.jar \ --conf spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.hadoop_catalog.type=hadoop \ --conf spark.sql.catalog.hadoop_catalog.warehouse=/iceberg/warehouse spark.sql("CREATE TABLE hadoop_catalog.testdb.iceberg_table_in_hdfs (report string ,reporttime string,reporttype string, reportid string , reportversion bigint, day string ) USING iceberg PARTITIONED BY (day)")
  2. beeline> CREATE EXTERNAL TABLE testdb.iceberg_table_in_hdfs (report string ,reporttime string,reporttype string, reportid string , reportversion bigint, day string ) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION '/iceberg/warehouse/testdb/iceberg_table_in_hdfs' TBLPROPERTIES ( 'iceberg.mr.catalog'='hadoop', 'iceberg.mr.catalog.hadoop.warehouse.location'='/iceberg/warehouse');
  3. presto:testdb> select * from iceberg_table_in_hdfs; Query 20210805_085251_00012_vs7zn failed: Table is missing [metadata_location] property: testdb.iceberg_table_in_hdfs io.prestosql.spi.PrestoException: Table is missing [metadata_location] property: testdb.iceberg_table_in_hdfs at io.prestosql.plugin.iceberg.HiveTableOperations.refresh(HiveTableOperations.java:151)
pvary commented 3 years ago

@TangYan-1: I guess that this is now a presto question. I would guess that now Presto is trying to handle the table as a HadoopTable instead of HadoopCatalog. So again I suspect a catalog configuration issue. Sadly I am not sure how Presto catalog handling was working. What about trying to set the following properties in the Presto session?

'iceberg.mr.catalog'='hadoop',
'iceberg.mr.catalog.hadoop.warehouse.location'='/iceberg/warehouse'
TangYan-1 commented 3 years ago

@pvary Thanks, I'll check in presto side. Here I have two more issues about iceberg.

  1. For set iceberg.mr.catalog=hive, how can I unset it in hive session? Seems when I set it to hive, the iceberg overlay table in hive can't be queried any more.
  2. NULLpointer exception in beeline agg query for the hadoop catalog iceberg tables. 1). create table using spark job and load data spark3-shell --jars iceberg-spark3-runtime-0.11.1.jar \ --conf spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.hadoop_catalog.type=hadoop \ --conf spark.sql.catalog.hadoop_catalog.warehouse=/iceberg/warehouse spark.sql("CREATE TABLE hadoop_catalog.testdb.iceberg_table_in_hdfs (report string ,reporttime string,reporttype string, reportid string , reportversion bigint, day string ) USING iceberg PARTITIONED BY (day)")

    2). create overlay in hive beeline CREATE EXTERNAL TABLE testdb.iceberg_table_in_hdfs (report string ,reporttime string,reporttype string, reportid string , reportversion bigint, day string ) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION '/iceberg/warehouse/testdb/iceberg_table_in_hdfs' TBLPROPERTIES ( 'iceberg.mr.catalog'='hadoop', 'iceberg.mr.catalog.hadoop.warehouse.location'='/iceberg/warehouse'); 3) . query from hive beeline (select count(*) from testdb.iceberg_table_in_hdfs) Error: java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)

pvary commented 3 years ago

@TangYan-1:

FYI: I will have a busy day today, and after that I will be offline for 2 weeks. Might not be able to come back to this issue for a while.

TangYan-1 commented 3 years ago

Full stack is as below. Iceberg version: 0.11.1 Hive vesion: 2.1.1-cdh6.3.0 2021-08-06 05:27:31,014 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: java.lang.NullPointerException at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:298) at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:685) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:175) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:444) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:349) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) Caused by: java.lang.NullPointerException at org.apache.iceberg.data.DeleteFilter.(DeleteFilter.java:90) at org.apache.iceberg.data.GenericDeleteFilter.(GenericDeleteFilter.java:33) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.open(IcebergInputFormat.java:275) at org.apache.iceberg.mr.mapreduce.IcebergInputFormat$IcebergRecordReader.initialize(IcebergInputFormat.java:197) at org.apache.iceberg.mr.mapred.MapredIcebergInputFormat$MapredIcebergRecordReader.(MapredIcebergInputFormat.java:92) at org.apache.iceberg.mr.mapred.MapredIcebergInputFormat.getRecordReader(MapredIcebergInputFormat.java:78) at org.apache.iceberg.mr.hive.HiveIcebergInputFormat.getRecordReader(HiveIcebergInputFormat.java:82) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:295) ... 9 more

Thanks for your reply. @pvary

pvary commented 3 years ago

The issue is here: https://github.com/apache/iceberg/blob/apache-iceberg-0.11.1/data/src/main/java/org/apache/iceberg/data/DeleteFilter.java#L90

Maybe try selecting specific columns instead of *?

Just a guess

TangYan-1 commented 3 years ago

I have tried count(col), but no lucky, the same error.

pvary commented 3 years ago

What about select col from testdb.iceberg_table_in_hdfs limit 10 just for testing. I have seen count working for our CDPD versions, and also we have unit tests which are running agains 2.8, and 3.1 Hive versions. Maybe some other fixes in Hive which are missing in the CDH version?

TangYan-1 commented 3 years ago

Yes, 'select col from testdb.iceberg_table_in_hdfs limit 10' can pass. seems using agg function in select failed.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 2 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'