apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.13k stars 2.13k forks source link

hive shell can not read iceberg table created by flink sql shell ? #3092

Closed deepeye closed 1 month ago

deepeye commented 2 years ago

./bin/sql-client.sh` embedded shell

create hive catalog

Flink SQL> CREATE CATALOG hive_catalog WITH ( 'type'='iceberg', 'catalog-type'='hive', 'uri'='thrift://localhost:9083', 'clients'='5', 'property-version'='1', 'io-impl'='org.apache.iceberg.aliyun.oss.OSSFileIO', 'oss.endpoint'='{endpoint}', 'oss.access.key.id'='{key.id}', 'oss.access.key.secret'='{key.secret}', 'warehouse'='oss://{bucket}/iceberg/warehouse' );

Flink SQL> USE CATALOG hive_catalog;

Flink SQL> CREATE DATABASE oss_db;

Flink SQL> USE oss_db;

Flink SQL> CREATE TABLE sample ( id BIGINT COMMENT 'unique id', data STRING ); Flink SQL> INSERT INTO sample VALUES (1, 'a'), (2, 'b'), (3, 'c');

Flink SQL> SELECT * FROM sample;

Flink sql can read data from sample,but switch hive shell, I read empty table, why?

This is hive external table

CREATE EXTERNAL TABLE hive_db.sample( id bigint, data string) STORED BY 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler' LOCATION 'oss://{bucket}/iceberg/warehouse/oss_db.db/sample' TBLPROPERTIES ( 'metadata_location'='oss://{bucket}/iceberg/warehouse/oss_db.db/sample/metadata/00003-add0c7f9-a5b6-4de4-be76-19481d0420d0.metadata.json', 'table_type'='ICEBERG', 'iceberg.catalog'='hive_catalog');

pvary commented 2 years ago

@deepeye: Could you please provide the following information:

Thanks, Peter

deepeye commented 2 years ago

@deepeye: Could you please provide the following information:

  • Iceberg version
  • Hive version
  • Query you are using for testing
  • Execution engine (MR/Tez)
  • Output of the SHOW CREATE TABLE command

Thanks, Peter

Iceberg master(0.12) + aliyun oss Hive2.3.6 SELECT * FROM sample; Execution engine: mr

hive (oss_db)> show create table sample;
OK
createtab_stmt
CREATE EXTERNAL TABLE `sample`(
  `id` bigint,
  `data` string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.FileInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.mapred.FileOutputFormat'
LOCATION
  'oss://{bucket}/iceberg/warehouse/oss_db.db/sample'
TBLPROPERTIES (
  'metadata_location'='oss://{bucket}/iceberg/warehouse/oss_db.db/sample/metadata/00004-1791baa2-2703-40f2-9fda-fdf7442927cf.metadata.json',
  'previous_metadata_location'='oss://{bucket}/iceberg/warehouse/oss_db.db/sample/metadata/00003-add0c7f9-a5b6-4de4-be76-19481d0420d0.metadata.json',
  'table_type'='ICEBERG',
  'transient_lastDdlTime'='1629797472')
Time taken: 0.761 seconds, Fetched: 16 row(s)
pvary commented 2 years ago

@deepeye: The problem for Hive that the SERDE / INPUTFORMAT / OUTPUTFORMAT is not correct. It should be HiveIceberg*** for all of them. I think the problem is that the engine.hive.enabled is not set for the table.

deepeye commented 2 years ago

@deepeye: The problem for Hive that the SERDE / INPUTFORMAT / OUTPUTFORMAT is not correct. It should be HiveIceberg*** for all of them. I think the problem is that the engine.hive.enabled is not set for the table.

Thinks, I try it!

deepeye commented 2 years ago

@pvary flink sql shell Flink SQL> set iceberg.engine.hive.enabled=true ; [INFO] Session property has been set. Flink SQL> CREATE TABLE sample4 ( id BIGINT COMMENT 'unique id', data STRING );

but show create table sample4 the SERDE / INPUTFORMAT / OUTPUTFORMAT still is not correct, why?

pvary commented 2 years ago

I would guess that

Flink SQL> set iceberg.engine.hive.enabled=true ;

I am not sure how Flink handles the session level configs. Is there a way to set the table property engine.hive.enabled to true?

kbendick commented 2 years ago

It would be easy to set the table properties via spark (which I know isn't in play here).

Maybe this is one of those cases where we (presently) creating tables in spark (or in hive) and then declaring them in Flink also?

Can you try putting engine.hive.enabled in the WITH part of the Flink create table DDL? It might not be allowed as I think only known properties determined by Flink can go there, but might be worth a check.

CREATE TABLE Sample (
  `id` BIGINT COMMENT 'unique id',
  `data` STRING
) WITH (
  -- not sure which of these two would be needed so possibly try both
  'engine.hive.enabled' = 'true',
  'iceberg.engine.hive.enabled' = 'true'
)
deepeye commented 2 years ago

Thanks @kbendick , @pvary This problem is solved! The sql is correct!

CREATE TABLE Sample (
  `id` BIGINT COMMENT 'unique id',
  `data` STRING
) WITH (
  'engine.hive.enabled' = 'true'
)
kbendick commented 2 years ago

@deepeye Can you also check if setting the config iceberg.engine.hive.enabled in your hive-site.xml or in your catalog creation would resolve this issue? Otherwise, we might want to patch this.

I think the easiest way to do it globally would be to use the catalog config (so that all tables from this catalog get the property). If this doesn't work, then we'll need a patch.

CREATE CATALOG hive_catalog WITH (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://localhost:9083',
'clients'='5',
'property-version'='1',
'io-impl'='org.apache.iceberg.aliyun.oss.OSSFileIO',
'oss.endpoint'='{endpoint}',
'oss.access.key.id'='{key.id}',
'oss.access.key.secret'='{key.secret}',
'warehouse'='oss://{bucket}/iceberg/warehouse',
'iceberg.engine.hive.enabled' = 'true'
);
SongJok commented 2 years ago

Excuse me!Our company operates Iceberg in the same way as you do. May I ask, how do you solve the problem of merging small files in the follow-up,thankyou

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 1 month ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'