apache / gravitino

World's most powerful open data catalog for building a high-performance, geo-distributed and federated metadata lake.
https://gravitino.apache.org
Apache License 2.0
690 stars 211 forks source link

[Bug report] [spark-connector] insert failed using OpenCSVSerde #3799

Open theoryxu opened 1 month ago

theoryxu commented 1 month ago

Version

main branch

Describe what's wrong

There are some errors using org.apache.hadoop.hive.serde2.OpenCSVSerde as row format serde.

PS. the kyuubi hive connector's behavior is right.

Error message and/or stacktrace

Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
       at org.apache.spark.sql.hive.HiveInspectors.$anonfun$wrapperFor$3(HiveInspectors.scala:280)
        at org.apache.spark.sql.hive.HiveInspectors.$anonfun$withNullSafe$1(HiveInspectors.scala:262)
        at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:170)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.write(WriteToDataSourceV2Exec.scala:516)
        at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteToDataSourceV2Exec.scala:471)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1567)
        at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:509)
        at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:448)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:514)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:411)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1533)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

How to reproduce

use gravitino_catalog.test_db;
create table csv_table (id int, name string, age int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' ;
insert into csv_table values(20, 'aaaa', 18);

Additional context

No response

FANNG1 commented 1 month ago

seems there is some special logic for OpenCSVSerde which treats all columns as strings, @theoryxu do you have time to dig out the reason and fix it?

theoryxu commented 1 month ago

seems there is some special logic for OpenCSVSerde which treats all columns as strings, @theoryxu do you have time to dig out the reason and fix it?

OK I'll figure it out

theoryxu commented 2 weeks ago

This issue is related to https://issues.apache.org/jira/browse/HIVE-13709, a weird Hive issue.

Because of HIVE-13709, the table's scheme differs between Gravitino & Hive Metastore when using OpenCSVSerde.

For Gravitino (desc formatted in SparkSql): image

For Hive Metastore (desc formatted in Hive): image

Therefore, there are some type mismatches when inserting. image

Should we do some compatible work at Gravitino or hit it in docs for users?

Could you let me know what your recommendations are? @FANNG1