apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

Apache hive 3 with Tez engine select table no empty #8891

Open anvanna opened 10 months ago

anvanna commented 10 months ago

I am testing iceberg table as below with hive version 3.x and Tez version 0.9.x

SET iceberg.catalog.hadoop_cat.type=hadoop; SET iceberg.catalog.hadoop_cat.warehouse=hdfs://hadoop-master:8020/hadoop_cat; SET iceberg.engine.hive.enabled=true SET hive.vectorized.execution.enabled=false;

CREATE TABLE table_a ( id bigint, name string ) PARTITIONED BY ( dept string ) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' TBLPROPERTIES ('iceberg.catalog'='hadoop_cat');

insert into table_a values (1000, 'ABC');

0: jdbc:hive2://hadoop-master:2> select from table_a; INFO : Compiling command(queryId=hive_20231020173641_a356d662-46fa-4503-893d-99b3ef2f96e2): select from table_a INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:table_g.id, type:bigint, comment:null), FieldSchema(name:table_g.name, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20231020173641_a356d662-46fa-4503-893d-99b3ef2f96e2); Time taken: 3.292 seconds INFO : Executing command(queryId=hive_20231020173641_a356d662-46fa-4503-893d-99b3ef2f96e2): select * from table_a INFO : Completed executing command(queryId=hive_20231020173641_a356d662-46fa-4503-893d-99b3ef2f96e2); Time taken: 0.0 seconds INFO : OK +-------------+---------------+ | table_a.id | table_a.name | +-------------+---------------+ +-------------+---------------+ No rows selected (3.477 seconds)

Why no result return?

pvary commented 10 months ago

@anvanna: Please read the relevant parts of the documentation about the supported Hive versions/features on the following page: https://iceberg.apache.org/docs/latest/hive/

I would guess, that the issue is with writing, and not with reading.

I would suggest, to try out the new Hive 4, which has backed in Iceberg compatibility

anvanna commented 10 months ago

@pvary i think the insert is ok as below:

0: jdbc:hive2://hadoop-maste:2> insert into table_g values (300, 'iceberg'); INFO : Compiling command(queryId=hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b): insert into table_g values (300, 'iceberg') INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, type:bigint, comment:null), FieldSchema(name:_col1, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b); Time taken: 1.609 seconds INFO : Executing command(queryId=hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b): insert into table_g values (300, 'iceberg') INFO : Query ID = hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b INFO : Total jobs = 1 INFO : Starting task [Stage-0:DDL] in serial mode INFO : Starting task [Stage-1:DDL] in serial mode INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-2:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: insert into table_g values (300, 'iceberg') (Stage-2) INFO : Status: Running (Executing on YARN cluster with App id application_1697615344511_0012)


    VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED  

Map 1 .......... container SUCCEEDED 1 1 0 0 0 0

VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 4.55 s

INFO : Status: DAG finished successfully in 4.50 seconds INFO : INFO : Query Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : OPERATION DURATION INFO : ---------------------------------------------------------------------------------------------- INFO : Compile Query 1.61s INFO : Prepare Plan 3.99s INFO : Get Query Coordinator (AM) 0.00s INFO : Submit Plan 0.49s INFO : Start DAG 0.58s INFO : Run DAG 4.50s INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : Task Execution Summary INFO : ---------------------------------------------------------------------------------------------- INFO : VERTICES DURATION(ms) CPU_TIME(ms) GC_TIME(ms) INPUT_RECORDS OUTPUT_RECORDS INFO : ---------------------------------------------------------------------------------------------- INFO : Map 1 2479.00 7,560 129 3 0 INFO : ---------------------------------------------------------------------------------------------- INFO : INFO : org.apache.tez.common.counters.DAGCounter: INFO : NUM_SUCCEEDED_TASKS: 1 INFO : TOTAL_LAUNCHED_TASKS: 1 INFO : DATA_LOCAL_TASKS: 1 INFO : AM_CPU_MILLISECONDS: 2580 INFO : AM_GC_TIME_MILLIS: 21 INFO : File System Counters: INFO : HDFS_BYTES_WRITTEN: 701 INFO : HDFS_WRITE_OPS: 1 INFO : HDFS_OP_CREATE: 1 INFO : org.apache.tez.common.counters.TaskCounter: INFO : GC_TIME_MILLIS: 129 INFO : TASK_DURATION_MILLIS: 2602 INFO : CPU_MILLISECONDS: 7560 INFO : PHYSICAL_MEMORY_BYTES: 859832320 INFO : VIRTUAL_MEMORY_BYTES: 2864087040 INFO : COMMITTED_HEAP_BYTES: 859832320 INFO : INPUT_RECORDS_PROCESSED: 4 INFO : INPUT_SPLIT_LENGTH_BYTES: 1 INFO : OUTPUT_RECORDS: 0 INFO : HIVE: INFO : CREATED_FILES: 1 INFO : DESERIALIZE_ERRORS: 0 INFO : RECORDS_IN_Map_1: 3 INFO : RECORDS_OUT_1_default.table_g: 1 INFO : RECORDS_OUT_INTERMEDIATE_Map_1: 0 INFO : RECORDS_OUT_OPERATOR_FS_5: 1 INFO : RECORDS_OUT_OPERATOR_MAP_0: 0 INFO : RECORDS_OUT_OPERATOR_SEL_1: 1 INFO : RECORDS_OUT_OPERATOR_SEL_3: 1 INFO : RECORDS_OUT_OPERATOR_TS_0: 1 INFO : RECORDS_OUT_OPERATOR_UDTF_2: 1 INFO : TaskCounter_Map_1_INPUT__dummy_table: INFO : INPUT_RECORDS_PROCESSED: 4 INFO : INPUT_SPLIT_LENGTH_BYTES: 1 INFO : TaskCounter_Map_1_OUTPUT_out_Map_1: INFO : OUTPUT_RECORDS: 0 INFO : Starting task [Stage-4:DDL] in serial mode INFO : Completed executing command(queryId=hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b); Time taken: 9.566 seconds INFO : OK No rows affected (11.23 seconds)

  1. and the parquet files & .metadata.json was created

[hdfs@hadoop-master ~]$ hdfs dfs -ls /user/hive/warehouse/default/table_g/dataFound 3 items-rw-r--r-- 3 hive hdfs 673 2023-10-20 16:56 /user/hive/warehouse/default/table_g/data/00000-0-hive_20231020165640_da5c38e9-b556-4e7f-b125-156e0d0b5679-job_1697615344511_0005-00001.parquet-rw-r--r-- 3 hive hdfs 673 2023-10-20 17:38 /user/hive/warehouse/default/table_g/data/00000-0-hive_20231020173840_17952753-a875-42a5-b572-979c08249888-job_1697615344511_0011-00001.parquet-rw-r--r-- 3 hive hdfs 701 2023-10-22 12:48 /user/hive/warehouse/default/table_g/data/00000-0-hive_20231022124803_fac475fe-2d99-408f-841a-1e7e0ba9558b-job_1697615344511_0012-00001.parquet[hdfs@hadoop-master ~]$ hdfs dfs -ls /user/hive/warehouse/default/table_g/metadataFound 1 items-rw-r--r-- 3 hive hdfs 1357 2023-10-20 16:55 /user/hive/warehouse/default/table_g/metadata/00000-c64d7b49-d2c6-4cf5-b4b1-0d75a3fb98c6.metadata.json

BsoBird commented 10 months ago

@anvanna hi. When you test HIVE to read ICEBERG, do you encounter the problem that the performance is too poor and the JOIN condition cannot be pushed down to the partition of ICEBERG table? https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1695050248606629

anvanna commented 10 months ago

@BsoBird no, just tested query after insert some rows to iceberg table, but no result return

pvary commented 10 months ago

@anvanna: Are you able to read the data from the Iceberg table with another tool? The issue with writing with Tez, that the newly created files are not propagated to the HS2 and they are not added to the commit. So it is absolutely possible that the files are there, and a new Iceberg snapshot is created, but this new snapshot does not contain the new files.

I would like to reiterate that the Tez writes are not supported. Could you try using MR for the insert and then read with Tez?