apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.47k stars 2.24k forks source link

Hive: partitioning is not working #9329

Open bluzy opened 11 months ago

bluzy commented 11 months ago

Apache Iceberg version

1.3.1

Query engine

Hive

Please describe the bug 🐞

I have a question when querying a partitioned table in Hive.

I have hourly partitioned table with Timestamp column. When I query to the table, I am getting OOM error.

SELECT count(1) FROM pdd__db_pbp.iceberg__raw_place_business_detail_v2
WHERE pdp.partition_timestamp BETWEEN "2023-12-13 14:00:00" AND "2023-12-13 14:10:00";
java.lang.OutOfMemoryError: Java heap space
  at com.google.protobuf.ByteString$CodedBuilder.(ByteString.java:907)
  at com.google.protobuf.ByteString$CodedBuilder.(ByteString.java:902)
  at com.google.protobuf.ByteString.newCodedBuilder(ByteString.java:898)
  at com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
  at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.createEventList(HiveSplitGenerator.java:357)
  at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:316)
  at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:281)
  at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:272)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
  at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:272)
  at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:256)
  at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
  at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
  at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

The range is small, size of data files in the range is about just 200mb. So I am suspecting that partition pruning is not working, and full scan is occurring.

The hive error log seems to be related:

2023-12-15 12:00:04,193 [WARN] [InputInitializer {Map 1} #0] |hive.HiveIcebergInputFormat|: Unable to create Iceberg filter, continuing without filter (will be applied by Hive later): 
java.lang.UnsupportedOperationException: CONSTANT operator is not supported
    at org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.translate(HiveIcebergFilterFactory.java:87)
    at org.apache.iceberg.mr.hive.HiveIcebergFilterFactory.generateFilterExpression(HiveIcebergFilterFactory.java:53)
    at org.apache.iceberg.mr.hive.HiveIcebergInputFormat.getSplits(HiveIcebergInputFormat.java:90)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:524)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:779)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:281)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:272)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:272)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:256)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

And query plan

+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)            |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 2                                    |
|       File Output Operator [FS_7]                  |
|         Group By Operator [GBY_5] (rows=1 width=1656) |
|           Output:["_col0"],aggregations:["count(VALUE._col0)"] |
|         <-Map 1 [CUSTOM_SIMPLE_EDGE]               |
|           PARTITION_ONLY_SHUFFLE [RS_4]            |
|             Group By Operator [GBY_3] (rows=1 width=1656) |
|               Output:["_col0"],aggregations:["count()"] |
|               Select Operator [SEL_2] (rows=932948 width=1565) |
|                 Filter Operator [FIL_8] (rows=932948 width=1565) |
|                   predicate:pdp.partition_timestamp BETWEEN TIMESTAMPLOCALTZ'2023-12-13 14:00:00.0 Asia/Seoul' AND TIMESTAMPLOCALTZ'2023-12-13 14:10:00.0 Asia/Seoul' |
|                   TableScan [TS_0] (rows=8396537 width=1565) |
|                     pdd__db_pbp@iceberg__raw_place_business_detail_v2,iceberg__raw_place_business_detail_v2,Tbl:COMPLETE,Col:NONE,Output:["pdp"] |
|                                                    |
+----------------------------------------------------+

I'm using these versions.

Hadoop 3.1.2 Hive 3.1.0 Tez 0.10.1

bluzy commented 9 months ago

I guess nested column seems to be cause. When I tested with 1-depth partition column, then the problem is not occured.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.