Open foryou7242 opened 2 months ago
@FANNG1
@foryou7242 , could you help to clarify the below questions?
@FANNG1
test1 cluster catalog
metastore.uris : thrift://test2:9083 spark sql configuration
spark-sql --master yarn --queue batch --deploy-mode client --conf spark.executor.cores=2 --conf spark.executor.instances=10 \
--conf spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin" \
--conf spark.sql.gravitino.uri=http://gravitino.stage.com \
--conf spark.sql.gravitino.metalake=TEST \
--jars hdfs://test1/user/spark/application/gravitino-spark-connector-runtime-3.4_2.12-0.7.0-incubating-SNAPSHOT.jar
Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?
Suspicion is that gravitino seems to be using kyuubihivetable for hive meta table connection But kyuubi only supports connecting one hive metastore, which seems to be the problem, am I right?
kyuubi hive connector could support multi hive mestatore, because Gravitino will create separate kyuubi hive instance for different catalogs which contains different hive metastore uri, I had tested two hive metastore with a shared HDFS cluster works well in the initial POC phase.
and could you share the SQL to create the table? Does querying data works well?
and could you share the SQL to create the table? Does querying data works well?
table is the same as issue because it's an existing table.
> show create table portal_test_schema;
CREATE TABLE portal_test_schema (
...
month INT,
day INT,
hour INT
)
PARTITIONED BY (month, day, hour)
LOCATION 'hdfs://test1/test1'
TBLPROPERTIES (
'bucketing_version' = '2',
'discover.partitions' = 'true',
'input-format' = 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat',
I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs. @foryou7242 could you try with the simple SQL like following?
create table a(a int) location 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a';
show create table a;
CREATE TABLE t1.a (
a INT)
LOCATION 'hdfs://localhost:9000/user/hive/warehouse/t1.db/a'
TBLPROPERTIES (
'input-format' = 'org.apache.hadoop.mapred.TextInputFormat',
'output-format' = 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat',
'serde-lib' = 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe',
'serde-name' = 'a',
'table-type' = 'MANAGED_TABLE',
'transient_lastDdlTime' = '1726756779')
explain show create table a;
explain show create table a
== Physical Plan ==
ShowCreateTable [createtab_stmt#36], HiveTable(org.apache.spark.sql.SparkSession@7752f9fe,CatalogTable(
Database: t1
Table: a
Owner: hive
Created Time: Thu Sep 19 22:39:39 CST 2024
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: MANAGED
Provider: hive
Table Properties: [gravitino.identifier=gravitino.v1.uid1812375099371418513, owner=hive, transient_lastDdlTime=1726756779]
Location: hdfs://localhost:9000/user/hive/warehouse/t1.db/a
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Partition Provider: Catalog
Schema: root
|-- a: integer (nullable = true)
),org.apache.kyuubi.spark.connector.hive.HiveTableCatalog@61563a91)
@FANNG1
I setup two hivemestatore with sperate hdfs cluster, and couldn't reproduce this issue with following SQLs in both of the two catalogs.
Did you know that spark-sql uses hdfs, which looks at a different metastore than localhost?
the above query results in the following error
spark-sql (test)> create table a(a int) location 'hdfs://test1/warehouse/tablespace/external/hive/test.db/a';
24/09/23 18:07:58 ERROR SparkSQLDriver: Failed in [create table a(a int) location 'hdfs://test1/warehouse/tablespace/external/hive/test.db/a']
java.lang.RuntimeException: Failed to load the real sparkTable: test.a
at org.apache.gravitino.spark.connector.catalog.BaseCatalog.loadSparkTable(BaseCatalog.java:459)
at org.apache.gravitino.spark.connector.catalog.BaseCatalog.createTable(BaseCatalog.java:222)
at org.apache.spark.sql.connector.catalog.TableCatalog.createTable(TableCatalog.java:199)
at org.apache.spark.sql.execution.datasources.v2.CreateTableExec.run(CreateTableExec.scala:44)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:512)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:104)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:512)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `test`.`a` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:256)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:541)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:526)
at org.apache.kyuubi.spark.connector.hive.HiveTableCatalog.$anonfun$loadTable$1(HiveTableCatalog.scala:166)
at org.apache.kyuubi.spark.connector.hive.HiveConnectorUtils$.withSQLConf(HiveConnectorUtils.scala:274)
at org.apache.kyuubi.spark.connector.hive.HiveTableCatalog.loadTable(HiveTableCatalog.scala:166)
at org.apache.gravitino.spark.connector.catalog.BaseCatalog.loadSparkTable(BaseCatalog.java:456)
... 60 more
You should set spark.bypass.spark.sql.hive.metastore.jars
explicitly in Hive catalog properties if using Spark SQL client, or use Spark shell instead. please refer https://gravitino.apache.org/docs/0.6.0-incubating/spark-connector/spark-catalog-hive
@FANNG1 Thank you so much for your help.
First of all, the spark.bypass.spark.sql.hive.metastore.jars
setting is already set to builtin
.
The root cause is that hive.metastore.uris
is set to test2 hive metastore in hive-site.xml. I deleted that setting and checked that it is queried normally when querying with spark-shell.
However, I still have the same problem with spark-sql, am I right in understanding that this is an issue that will be fixed in the future?
scala> val df = spark.table("test2.test").filter($"month" === "202009" && $"day" === "20200928").limit(100)
24/09/23 23:20:02 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site.xml
24/09/23 23:20:02 INFO metastore: Trying to connect to metastore with URI thrift://test1:9083
24/09/23 23:20:02 INFO metastore: Opened a connection to metastore, current connections: 1
24/09/23 23:20:02 INFO metastore: Connected to metastore.
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [test1: string, month: string ... 1 more field]
scala> df.show(false)
+------------------------------------+------+--------+
|test1 |month |day |
+------------------------------------+------+--------+
|1231231235555555557|202009|20200928|
|1231231235555555557|202009|20200928|
|1231231235555555557|202009|20200928|
spark-sql (default)> show create table test2.test;
24/09/23 23:17:41 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
24/09/23 23:17:42 WARN ObjectStore: Failed to get database test2, returning NoSuchObjectException
[TABLE_OR_VIEW_NOT_FOUND] The table or view `test2`.`test` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 18;
'ShowCreateTable false, [createtab_stmt#3]
+- 'UnresolvedTableOrView [test2, test], SHOW CREATE TABLE, false
First of all, the spark.bypass.spark.sql.hive.metastore.jars setting is already set to builtin.
spark.bypass.spark.sql.hive.metastore.jars
should be set to the values other than buildin
, like
"spark.bypass.spark.sql.hive.metastore.jars":"path",
"spark.bypass.spark.sql.hive.metastore.jars.path":"file:///xx/hive/lib/*"
The root cause is that hive.metastore.uris is set to test2 hive metastore in hive-site.xml. I deleted that setting and checked that it is queried normally when querying with spark-shell.
Is hive.metastore.uris
a final setting in hive-site.xml
?
Version
main branch
Describe what's wrong
I want to use federation query using hive metastore stored in 2 hadoop clusters.
So we added two hive catalogues to metalake.
There is a difference between the location path in the show create table and the actual location information when sql-sql querying.
It seems to be an effect of the actual spark-sql query
spark.sql.metastore.uris
option, so I'm wondering if it's possible to federate query 2 hives?Error message and/or stacktrace
explain query
How to reproduce
gravitino branch main
Additional context
No response