apache / kyuubi

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
https://kyuubi.apache.org/
Apache License 2.0
2.06k stars 904 forks source link

[Bug] Error while compiling statement: FAILED: ParseException line 1:37 cannot recognize input near '.' 'id1' 'BIGINT' in column type #4051

Open mgyboom opened 1 year ago

mgyboom commented 1 year ago

Code of Conduct

Search before asking

Describe the bug

I use the hive jdbc dialect plugin like this:https://kyuubi.readthedocs.io/en/latest/extensions/engines/spark/jdbc-dialect.html But I'm running into some issues in my spark program.

I build kyuubi-extension-spark-jdbc-dialect_2.12-1.7.0-SNAPSHOT.jar from branch 'master', put it into $SPARK_HOEM/jars.

show my code:


import org.apache.spark.SparkConf;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class HiveDemo {

    private static final Logger log = LoggerFactory.getLogger(HiveDemo.class);

    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf();
        sparkConf.set("spark.sql.extensions", "org.apache.spark.sql.dialect.KyuubiSparkJdbcDialectExtension");
        SparkSession sparkSession = SparkSession.builder()
                .config(sparkConf)
                .enableHiveSupport()
                .getOrCreate();
        boolean flag = true;
        try {
            Dataset<Row> rowDataset = sparkSession.read()
                    .format("jdbc")
                    .option("url", "jdbc:hive2://10.58.11.24:10000/ibond")
                    .option("dbtable", "a1000w_30d")
                    .option("user", "tdops")
                    .option("password", "None")
                    .option("driver", "org.apache.hive.jdbc.HiveDriver")
                    .load();
            rowDataset.show();
            rowDataset.limit(20)
                    .write()
                    .format("jdbc")
                    .mode(SaveMode.Overwrite)
                    .option("driver", "org.apache.hive.jdbc.HiveDriver")
                    .option("url", "jdbc:hive2://10.58.11.24:10000/ibond")
                    .option("dbtable", "mgy_test_1")
                    .option("user", "tdops")
                    .option("password", "None")
                    .save();
            log.info("Hive Demo Success.");
        } catch (Exception ex) {
            ex.printStackTrace();
            flag = false;
            throw new RuntimeException("Hive Demo Failed.", ex);
        } finally {
            sparkSession.stop();
            System.exit(flag ? 0 : -1);
        }
    }
}

console output:

org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:37 cannot recognize input near '.' 'id1' 'BIGINT' in column type

    at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:267)

    at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:253)

    at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:313)

    at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:253)

    at org.apache.hive.jdbc.HiveStatement.executeUpdate(HiveStatement.java:490)

    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.executeStatement(JdbcUtils.scala:993)

    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:878)

    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:81)

    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)

    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)

    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)

    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)

    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)

    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)

    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)

    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)

    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)

    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)

    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)

    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)

    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)

    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)

    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)

    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)

    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)

    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)

    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)

    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)

    at cn.tongdun.sparkdatahandler.HiveDemo.main(HiveDemo.java:46)

    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)

    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)

    at java.base/java.lang.reflect.Method.invoke(Unknown Source)

    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)

    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)

    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)

    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)

    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)

    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1030)

    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1039)

    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ParseException line 1:37 cannot recognize input near '.' 'id1' 'BIGINT' in column type

    at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380)

    at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:206)

    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)

    at org.apache.hive.service.cli.operation.Operation.run(Operation.java:320)

    at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:530)

    at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:517)

    at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:310)

    at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:530)

    at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)

    at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)

    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)

    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)

    at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)

    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

    at java.lang.Thread.run(Thread.java:748)

 Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.parse.ParseException:line 1:37 cannot recognize input near '.' 'id1' 'BIGINT' in column type

    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:211)

    at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:77)

    at org.apache.hadoop.hive.ql.parse.ParseUtils.parse(ParseUtils.java:70)

    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:468)

    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1317)

    at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1295)

    at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:204)

    ... 15 more

Affects Version(s)

master

Kyuubi Server Log Output

No response

Kyuubi Engine Log Output

No response

Kyuubi Server Configurations

No response

Kyuubi Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

github-actions[bot] commented 1 year ago

Hello @mgyboom, Thanks for finding the time to report the issue! We really appreciate the community's efforts to improve Apache Kyuubi (Incubating).

pan3793 commented 1 year ago

... helps to solve failures access Kyuubi.

The stackstrace indicates that you are accessing HiveServer2 instead of Kyuubi.

BTW, it does not support writing, and this way is not efficient, it is there mostly because of security purposes.

What are your original requirements?

mgyboom commented 1 year ago

... 有助于解决访问Kyuubi的故障。

stackstrace 表明您正在访问 HiveServer2 而不是 Kyuubi。

顺便说一句,它不支持写入,而且这种方式效率不高,主要是出于安全目的。

您的原始要求是什么?

Yes, I'm making my Spark program access to Hive data sources, through JDBC way. 'kyuubi hive jdbc dialect plugin' seems to make spark read hive tables, but cannot write hive tables. Is it because the plugin is not supported?

yaooqinn commented 1 year ago

also cc @bowenliang123

yaooqinn commented 1 year ago

can you check the HiveServer2 log to find the SQL string it actually received?

bowenliang123 commented 1 year ago

Plus, could you show us the ddl of tables ibond. a1000w_30d and ibond. mgy_test_1 ? Metadata of column type could help to investigate in this issue.

bowenliang123 commented 1 year ago

Btw, using JDBC RDD in Spark to write data to Kyuubi via HiveDriver or KyuubiHiveDriver is not working. It's known that Spark JDBC RDD force to use addBatch method of the JDBC driver and this method is currently not implemented in both drivers.

mgyboom commented 1 year ago

Yes, I found that the PreparedStatement.addBatch method is called in JdbcUtil.ssavePartition, but HivePreparedStatement does not implement specific logic, but directly throws SQLFeatureNotSupportedException("Method not supported").

Does Kyuubi-Hive-JDBC plan to support Spark Write DataFrame into Hive Tables in the future?

mgyboom commented 1 year ago

Plus, could you show us the ddl of tables ibond. a1000w_30d and ibond. mgy_test_1 ? Metadata of column type could help to investigate in this issue.

DDL is as follows:

create table a1000w_30d
(
    id1 bigint,
    y int,
    x0 string
)
row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
    with serdeproperties ('field.delim' = ',') stored as
    inputformat 'org.apache.hadoop.mapred.TextInputFormat'
    outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 'hdfs://namenode:8020/user/hive/warehouse/ibond.db/a1000w_30d'
tblproperties ('skip.header.line.count' = '1');
bowenliang123 commented 1 year ago

Is ibond.mgy_test_1 a non-existed table?

bowenliang123 commented 1 year ago

I think the problem is not the type mapping. Since BIGINT is supported by Hive and Spark correctly convert from Spark's LongType to Hive's BIGINT in Spark's JDBCUtils.getCommonJDBCType following the KyuubiHiveDialect. I guess it is more likely to about auto created column name in Spark, the . in column name puzzles HiveServer2.

Try to workaround this with explicit schema definition for JDBC writer in Spark. (But and then you will face the unimplemented addBatch problem. 🤣)

mgyboom commented 1 year ago

Is ibond.mgy_test_1 a non-existed table?

Yes, ibond.mgy_test_1 does not exist.

mgyboom commented 1 year ago

我认为问题不在于类型映射。由于 Hive 和 Spark 支持 BIGINT,因此在 Spark 的KyuubiHiveDialect之后正确地将 Spark 转换LongType 为 Hive 。 我想这更有可能是关于 Spark 中自动创建的列名,列名中的列名让 HiveServer2 感到困惑。BIGINT``JDBCUtils.getCommonJDBCType.

尝试使用 Spark 中 JDBC 编写器的显式模式定义来解决此问题。(但是然后你将面临未实现的addBatch问题。🤗)

Thank you for your answer.

hellozepp commented 1 year ago

@mgyboom I made a simple design based on HDFS; , just to set the ball rolling.https://blog.csdn.net/m0_58032574/article/details/128846418?spm=1001.2014.3001.5501