apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.11k stars 916 forks source link

[Bug] The configuration of Kyuubi SparkSQL query engine setting Hudi Schema Evolution has not taken effect #4885

Open yihao-tcf opened 1 year ago

yihao-tcf commented 1 year ago

Code of Conduct

Search before asking

Describe the bug

Spark version:3.2.3 Hudi version:0.13.0

desc:Connect to the SparkSQL query engine through Kyuubi, expose the service using Hive JDBC Driver, and delete Hudi table fields using Hudi Schema Evolution. The error message is: DROP COLUMN is only supported with v2 tables, But I have no problem using the Hudi Schema Evolution feature through SparkSQL to delete fields from the Hudi table. If using the Hudi Schema Evolution feature, two configurations need to be set: set hoodie.schema.on.read.enable=true; set hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true;

If the configuration is not set in SparkSQL, deleting the Hudi table field will also prompt the same error message. So it seems that the configuration I set when using Kyuubi on Spark JdbcDriver does not take effect

Using SparkSQL operations: 221075d04a4715085c8335610663482 3b59f02badb716e620bd8e8885c4dde

Using Kyuubi operations: dd74945f9e6c18a0e83de5e10065f2c f21d6345236da15cec9152c667e5d1f

Affects Version(s)

1.6.0

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

## Kyuubi Configurations

#
# kyuubi.authentication           NONE
# kyuubi.frontend.bind.host       localhost
 kyuubi.frontend.bind.port       8788
# HA
kyuubi.ha.zookeeper.quorum  xxx:2181,xxx:2181,xxx:2181

#connection pool
kyuubi.frontend.thrift.max.worker.threads 500000
kyuubi.frontend.mysql.max.worker.threads  500000

# share
kyuubi.engine.share.level USER
#kyuubi.engine.single.spark.session true
#kyuubi.engine.share.level SERVER
spark.dynamicAllocation.enabled=true
##false if perfer shuffle tracking than ESS
spark.shuffle.service.enabled=true
spark.dynamicAllocation.initialExecutors=5
spark.dynamicAllocation.minExecutors=5
spark.dynamicAllocation.maxExecutors=500
spark.dynamicAllocation.executorAllocationRatio=0.5
spark.dynamicAllocation.executorIdleTimeout=60s
spark.dynamicAllocation.cachedExecutorIdleTimeout=30min
## true if perfer shuffle tracking than ESS
spark.dynamicAllocation.shuffleTracking.enabled=false
spark.dynamicAllocation.shuffleTracking.timeout=30min
spark.dynamicAllocation.schedulerBacklogTimeout=1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1s
spark.cleaner.periodicGC.interval=5min

# Monitoring on prometheus
kyuubi.metrics.reporters PROMETHEUS 
kyuubi.metrics.prometheus.port 10019
kyuubi.metrics.prometheus.path /metrics

#JDBC Authentication
kyuubi.authentication=JDBC
kyuubi.authentication.jdbc.driver.class = com.mysql.jdbc.Driver
kyuubi.authentication.jdbc.url = jdbc:mysql://xx.xx.xx.xx:3306/kyuubi
kyuubi.authentication.jdbc.user =xxx
kyuubi.authentication.jdbc.password =xxx
kyuubi.authentication.jdbc.query = SELECT 1 FROM t_kyuubi_user WHERE user=${user} AND passwd=md5(${password})

# Spark Configurations
spark.master yarn
spark.yarn.jars=hdfs://mycluster/spark-jars/*.jar
spark.executor.memory 5G
spark.executor.cores 3
spark.executor.heartbeatInterval 200000
spark.network.timeout 300000
#spark.dynamicAllocation.enabled true
#spark.dynamicAllocation.minExecutors 0
#spark.dynamicAllocation.maxExecutors 20
#spark.dynamicAllocation.executorIdleTimeout 60
spark.submit.deployMode cluster
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.parquet.datetimeRebaseModeInRead=CORRECTED
spark.kryoserializer.buffer.max=512
spark.hadoop.hive.exec.dynamic.partition=true
spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict

spark.hoodie.schema.on.read.enable=true
spark.hoodie.datasource.write.reconcile.schema=true
spark.hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true
# Details in https://kyuubi.apache.org/docs/latest/deployment/settings.html

Kyuubi Engine Configurations

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master                     spark://master:7077
  spark.master                     yarn
  spark.eventLog.enabled           true
  spark.eventLog.dir               hdfs://mycluster/spark-logs
  spark.eventLog.compress          true
  spark.executor.logs.rolling.maxSize     10000000
  spark.executor.logs.rolling.maxRetainedFiles 10
  spark.yarn.jars=hdfs://mycluster/spark-jars/*.jar
  spark.driver.extraClassPath /opt/module/spark-3.2.3/external_jars/*
  spark.executor.extraClassPath /opt/module/spark-3.2.3/external_jars/*
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
spark.yarn.historyServer.address=xxxx:18080
spark.history.ui.port=18080

# HUDI_CONF
spark.serializer                  org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog org.apache.spark.sql.hudi.catalog.HoodieCatalog
spark.sql.extensions              org.apache.spark.sql.hudi.HoodieSparkSessionExtension,org.apache.kyuubi.plugin.spark.authz.ranger.RangerSparkExtension
#spark.sql.warehouse.dir           file:///tmp/hudi-bundles/hive/warehouse
spark.sql.warehouse.dir           hdfs://xxx:8020/user/hive/warehouse
spark.default.parallelism         8
spark.sql.shuffle.partitions      8
spark.sql.parquet.datetimeRebaseModeInRead CORRECTED

#spark optimize

spark.kryoserializer.buffer.max=254
spark.executor.memory 3G
spark.executor.cores 3
spark.executor.heartbeatInterval 200000
spark.network.timeout 300000
spark.driver.cores=2
spark.driver.memory=3g
spark.driver.maxResultSize 2g
spark.dynamicAllocation.enabled=true
##false if perfer shuffle tracking than ESS
spark.shuffle.service.enabled=true
spark.dynamicAllocation.initialExecutors=3
spark.dynamicAllocation.minExecutors=3
spark.dynamicAllocation.maxExecutors=500
spark.dynamicAllocation.executorAllocationRatio=0.5
spark.dynamicAllocation.executorIdleTimeout=60s
spark.dynamicAllocation.cachedExecutorIdleTimeout=30min
### true if perfer shuffle tracking than ESS
spark.dynamicAllocation.shuffleTracking.enabled=false
spark.dynamicAllocation.shuffleTracking.timeout=30min
spark.dynamicAllocation.schedulerBacklogTimeout=1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout=1s
spark.cleaner.periodicGC.interval=5min
#hadoop ha
spark.hadoop.user.name=xxx
spark.hadoop.fs.defaultFS=hdfs://mycluster

Additional context

No response

Are you willing to submit PR?

pan3793 commented 1 year ago

What's the result of spark-shell?

Comparing with Kyuubi, that spark-sql has some hacks on the Hive IsolateClassloader, not sure if they are related

cxzl25 commented 1 year ago

I can't reproduce.

Kyuubi 1.7.0 Spark 3.2.3 Hudi 0.13.0

image
yihao-tcf commented 1 year ago

I can't reproduce.

Kyuubi 1.7.0 Spark 3.2.3 Hudi 0.13.0

image

Hello, can you provide your Kyuubi Server Configuration and Kyuubi Server Configuration? thanks

yihao-tcf commented 1 year ago

What's the result of spark-shell?

Comparing with Kyuubi, that spark-sql has some hacks on the Hive IsolateClassloader, not sure if they are related

Success through spark-shell

cxzl25 commented 1 year ago

Hello, can you provide your Kyuubi Server Configuration and Kyuubi Server Configuration? thanks

Kyuubi server only configures Spark home in kyuubi-env.sh The configuration of spark is as follows

SPARK_HOME/conf/spark-defaults.conf

spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog

SPARK_HOME/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
  </property>
</configuration>
yihao-tcf commented 1 year ago

Hello, can you provide your Kyuubi Server Configuration and Kyuubi Server Configuration? thanks

Kyuubi server only configures Spark home in kyuubi-env.sh The configuration of spark is as follows

SPARK_HOME/conf/spark-defaults.conf

spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog

SPARK_HOME/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://localhost:9083</value>
  </property>
</configuration>

Hello buddy, could you provide a screenshot of the corresponding SQL Details and SQL/DataFrame Properties information found in the SQL/DataFrame menu bar in the Spark UI after executing the dml statement through kyuubi? Let me compare the differences between us. Thank you.

As shown in the following figure: 1689733063642

yihao-tcf commented 1 year ago

Connect to the Spark SQL engine through kyuubi and set parameters: hoodie.schema.on.read.enable=true Debug hudi-spark3-datasource found that the configuration settings for hoodie.schema.on.read.enable were not successful and converted my table to DS v1. Check out HUDI-4178 for more details.

This is already my second time switching versions Hudi 0.13.0 Spark. 3.3.2 Kyuubi 1.7.1

image

pan3793 commented 1 year ago

Would you mind trying kyuubi.engine.single.spark.session=true(add in kyuubi-defaults.conf)?

One difference between Kyuubi and spark-sql spark-shell is that Kyuubi use difference SparkSession for difference session(JDBC connection or Beeline session), but the latter only use one global SparkSession.

cxzl25 commented 1 year ago
image
cxzl25 commented 1 year ago
Hudi 0.13.0
Spark. 3.3.2
Kyuubi 1.7.1
create table hudi_mor_tbl (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
);
set hoodie.schema.on.read.enable=true;
set hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true;
alter table hudi_mor_tbl drop column price;
image

BTW You can find me in the Kyuubi WeChat user group or Slack, and we can communicate offline.

yihao-tcf commented 1 year ago

Would you mind trying kyuubi.engine.single.spark.session=true(add in kyuubi-defaults.conf)?

One difference between Kyuubi and spark-sql spark-shell is that Kyuubi use difference SparkSession for difference session(JDBC connection or Beeline session), but the latter only use one global SparkSession.

Thank you very much. After setting kyuubi.engine.single.park.session=true configuration, my problem has been resolved. But I still have doubts. I checked the previous errors through the SPARK UI, and all my SQL was executed in one session, but why did I report an error. 1689818310013

yihao-tcf commented 1 year ago
Hudi 0.13.0
Spark. 3.3.2
Kyuubi 1.7.1
create table hudi_mor_tbl (
  id int,
  name string,
  price double,
  ts bigint
) using hudi
tblproperties (
  type = 'mor',
  primaryKey = 'id',
  preCombineField = 'ts'
);
set hoodie.schema.on.read.enable=true;
set hoodie.datasource.write.schema.allow.auto.evolution.column.drop=true;
alter table hudi_mor_tbl drop column price;
image

BTW You can find me in the Kyuubi WeChat user group or Slack, and we can communicate offline.

Thank you very much for handling this issue for me. May I know how to join the kyuubi WeChat user group.

pan3793 commented 1 year ago

May I know how to join the kyuubi WeChat user group.

Please check FAQ https://github.com/apache/kyuubi/discussions/2481

cxzl25 commented 1 year ago

I checked the previous errors through the SPARK UI, and all my SQL was executed in one session, but why did I report an error.

This is really a little strange. The parameters of the same session should take effect.

image
pan3793 commented 1 year ago

After setting kyuubi.engine.single.park.session=true configuration, my problem has been resolved.

One possibility is that Hudi holds the wrong SparkSession instance.

Please note that kyuubi.engine.single.park.session=true is not suggested in common cases, when enabled, SET x=y in one JDBC connection also affects others (because all connections in a Spark Application share one SparkSession).

cxzl25 commented 1 year ago

One possibility is that Hudi holds the wrong SparkSession instance.

This is indeed possible, I have now noticed that Hudi initialization to obtain SparkSession is currently active, and then it may be wrong.

org.apache.spark.sql.hudi.catalog.HoodieCatalog

  val spark: SparkSession = SparkSession.active

org.apache.spark.sql.hudi.catalog.HoodieCatalog#loadTable

        val schemaEvolutionEnabled: Boolean = spark.sessionState.conf.getConfString(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key,
          DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.defaultValue.toString).toBoolean
yihao-tcf commented 1 year ago

一种可能性是胡迪持有错误的例子。SparkSession

这确实是可能的,我现在注意到 Hudi 初始化以获取 SparkSession 当前处于活动状态,然后可能是错误的。

org.apache.spark.sql.hudi.catalog.HoodieCatalog

  val spark: SparkSession = SparkSession.active

org.apache.spark.sql.hudi.catalog.HoodieCatalog#loadTable

        val schemaEvolutionEnabled: Boolean = spark.sessionState.conf.getConfString(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key,
          DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.defaultValue.toString).toBoolean

I can understand it as Hudi obtaining an arbitrary SparkSession, which leads to the inability to access the corresponding configurations. In this light, it seems that neither party has a bug, but rather an issue arising from their interaction. It appears that resolving this problem could be quite challenging.

njalan commented 1 year ago

@pan3793 When I set kyuubi.engine.single.spark.session=true I still face the below error message, these tables are keep update by spark streaming. Caused by: java.io.FileNotFoundException: No such file or directory: Did you ever face the same issues?

yihao-tcf commented 1 year ago

@pan3793 When I set kyuubi.engine.single.spark.session=true I still face the below error message, these tables are keep update by spark streaming. Caused by: java.io.FileNotFoundException: No such file or directory: Did you ever face the same issues?

Yes, if the Single pattern Kyuubi Server is enabled, new Spark Sessions will not be generated. The processing method can refer to Hudi issues: https://github.com/apache/hudi/issues/7452

njalan commented 1 year ago

is it the same thing? My error message is due to can't find file for that commit. I use the default setting for hoodie.keep.max/min.commits and should be keep at least 20 commits. After refresh table it is working fine by spark-sql. Why here kyuubi used to get this error message

pan3793 commented 1 year ago

is it the same thing?

@njalan it should not be the same issue. pointing to this issue/disscusion because there are possibilities that Kyuubi mutli sessions vs. spark-sql single session may cause some differences, especially if someone claims everything works well in spark-sql but not Kyuubi.