apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
1.09k stars 343 forks source link

[Bug report] federation query using 2 hive metastores does not work when using gravitino #4932

Open foryou7242 opened 2 months ago

foryou7242 commented 2 months ago

Version

main branch

Describe what's wrong

I want to use federation query using hive metastore stored in 2 hadoop clusters.

So we added two hive catalogues to metalake.

There is a difference between the location path in the show create table and the actual location information when sql-sql querying.

image

It seems to be an effect of the actual spark-sql query spark.sql.metastore.uris option, so I'm wondering if it's possible to federate query 2 hives?

Error message and/or stacktrace

>  show create table   portal_test_schema;
CREATE TABLE portal_test_schema (
...
  month INT,
  day INT,
  hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'discover.partitions' = 'true',
  'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',

explain query

spark-sql ()> EXPLAIN show create table   portal_test_schema;
== Physical Plan ==
ShowCreateTable [createtab_stmt#0], HiveTable(org.apache.spark.sql.SparkSession@14144cc9,CatalogTable(
Database: ladp
Table: portal_test_schema
Created Time: Thu Jan 26 18:40:15 JST 2023
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: EXTERNAL
Provider: hive
Table Properties: [bucketing_version=2, numFilesErasureCoded=0, transient_lastDdlTime=1725947686]
Location: hdfs://test2/portal_test_schema
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
...
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@64cbc28e)

How to reproduce

gravitino branch main

Additional context

No response

jerqi commented 2 months ago

@FANNG1

FANNG1 commented 2 months ago

@foryou7242 , could you help to clarify the below questions?

  1. For your enviroment, is it only one Hive metastore, but two HDFS clusters?
  2. The main problem is you create a table with location hdfs://hdfs1/xxx, but show create table shows the location is hdfs://hdfs2/xxx, YES?
  3. could you share the catalog propertis when you create hive catalog and the Spark configurations when using SparkSQL?
foryou7242 commented 2 months ago

@FANNG1

  1. no, 2 hive metastore and 2 hdfs cluster
  2. yes

test1 cluster catalog

FANNG1 commented 2 months ago

Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?

kyuubi hive connector could support multi hive mestatore, because Gravitino will create separate kyuubi hive instance for different catalogs which contains different hive metastore uri, I had tested two hive metastore with a shared HDFS cluster works well in the initial POC phase.

and could you share the SQL to create the table? Does querying data works well?

foryou7242 commented 2 months ago

and could you share the SQL to create the table? Does querying data works well?

table is the same as issue because it's an existing table.

>  show create table   portal_test_schema;
CREATE TABLE portal_test_schema (
...
  month INT,
  day INT,
  hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
  'bucketing_version' = '2',
  'discover.partitions' = 'true',
  'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
FANNG1 commented 2 months ago

I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs. @foryou7242 could you try with the simple SQL like following?

create table a(a int) location 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a';
show create table a;
CREATE TABLE t1.a (
  a INT)
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a'
TBLPROPERTIES (
  'input-format' = 'org.apache.hadoop.mapred.TextInputFormat',
  'output-format' = 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
  'serde-lib' = 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
  'serde-name' = 'a',
  'table-type' = 'MANAGED_TABLE',
  'transient_lastDdlTime' = '1726756779')
explain show create table a;
explain show create table a
== Physical Plan ==
ShowCreateTable [createtab_stmt#36], HiveTable(org.apache.spark.sql.SparkSession@7752f9fe,CatalogTable(
Database: t1
Table: a
Owner: hive
Created Time: Thu Sep 19 22:39:39 CST 2024
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: MANAGED
Provider: hive
Table Properties: [gravitino.identifier=gravitino.v1.uid1812375099371418513, owner=hive, transient_lastDdlTime=1726756779]
Location: hdfs://localhost:9000/user/hive/warehouse/t1.db/a
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Schema: root
 |-- a: integer (nullable = true)
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@61563a91)
foryou7242 commented 2 months ago

@FANNG1

I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs.

Did you know that spark-sql uses hdfs, which looks at a different metastore than localhost?

the above query results in the following error

spark-sql (test)> create table a(a int) location 'hdfs://test1/warehouse/tablespace/external/hive/test.db/a';
24/09/23 18:07:58 ERROR SparkSQLDriver: Failed in [create table a(a int) location 'hdfs://test1/warehouse/tablespace/external/hive/test.db/a']
java.lang.RuntimeException: Failed to load the real sparkTable: test.a
    at org.apache.gravitino.spark.connector.catalog.BaseCatalog.loadSparkTable(BaseCatalog.java:459)
    at org.apache.gravitino.spark.connector.catalog.BaseCatalog.createTable(BaseCatalog.java:222)
    at org.apache.spark.sql.connector.catalog.TableCatalog.createTable(TableCatalog.java:199)
    at org.apache.spark.sql.execution.datasources.v2.CreateTableExec.run(CreateTableExec.scala:44)
    at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
    at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
    at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)

Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `test`.`a` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:256)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:541)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:526)
    at org.apache.kyuubi.spark.connector.hive.HiveTableCatalog.$anonfun$loadTable$1(HiveTableCatalog.scala:166)
    at org.apache.kyuubi.spark.connector.hive.HiveConnectorUtils$.withSQLConf(HiveConnectorUtils.scala:274)
    at org.apache.kyuubi.spark.connector.hive.HiveTableCatalog.loadTable(HiveTableCatalog.scala:166)
    at org.apache.gravitino.spark.connector.catalog.BaseCatalog.loadSparkTable(BaseCatalog.java:456)
    ... 60 more
FANNG1 commented 2 months ago

You should set spark.bypass.spark.sql.hive.metastore.jars explicitly in Hive catalog properties if using Spark SQL client, or use Spark shell instead. please refer https://gravitino.apache.org/docs/0.6.0-incubating/spark-connector/spark-catalog-hive

foryou7242 commented 2 months ago

@FANNG1 Thank you so much for your help.

First of all, the spark.bypass.spark.sql.hive.metastore.jars setting is already set to builtin.

The root cause is that hive.metastore.uris is set to test2 hive metastore in hive-site.xml. I deleted that setting and checked that it is queried normally when querying with spark-shell.

However, I still have the same problem with spark-sql, am I right in understanding that this is an issue that will be fixed in the future?

scala> val df = spark.table("test2.test").filter($"month" === "202009" && $"day" === "20200928").limit(100)
24/09/23 23:20:02 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site.xml
24/09/23 23:20:02 INFO metastore: Trying to connect to metastore with URI thrift://test1:9083
24/09/23 23:20:02 INFO metastore: Opened a connection to metastore, current connections: 1
24/09/23 23:20:02 INFO metastore: Connected to metastore.
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [test1: string, month: string ... 1 more field]

scala> df.show(false)
+------------------------------------+------+--------+
|test1                            |month |day     |
+------------------------------------+------+--------+
|1231231235555555557|202009|20200928|
|1231231235555555557|202009|20200928|
|1231231235555555557|202009|20200928|
spark-sql (default)> show create table test2.test;
24/09/23 23:17:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
24/09/23 23:17:42 WARN ObjectStore: Failed to get database test2, returning NoSuchObjectException
[TABLE_OR_VIEW_NOT_FOUND] The table or view `test2`.`test` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 18;
'ShowCreateTable false, [createtab_stmt#3]
+- 'UnresolvedTableOrView [test2, test], SHOW CREATE TABLE, false
FANNG1 commented 1 month ago

First of all, the spark.bypass.spark.sql.hive.metastore.jars setting is already set to builtin.

spark.bypass.spark.sql.hive.metastore.jars should be set to the values other than buildin, like

        "spark.bypass.spark.sql.hive.metastore.jars":"path",
        "spark.bypass.spark.sql.hive.metastore.jars.path":"file:///xx/hive/lib/*"

The root cause is that hive.metastore.uris is set to test2 hive metastore in hive-site.xml. I deleted that setting and checked that it is queried normally when querying with spark-shell.

Is hive.metastore.uris a final setting in hive-site.xml?