[Bug report] [spark-connector] insert failed using OpenCSVSerde

theoryxu commented 1 month ago

Version

main branch

Describe what's wrong

There are some errors using org.apache.hadoop.hive.serde2.OpenCSVSerde as row format serde.

PS. the kyuubi hive connector's behavior is right.

Error message and/or stacktrace

Caused by: java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
       at org.apache.spark.sql.hive.HiveInspectors.$anonfun$wrapperFor$3(HiveInspectors.scala:280)
        at org.apache.spark.sql.hive.HiveInspectors.$anonfun$withNullSafe$1(HiveInspectors.scala:262)
        at org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:170)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:175)
        at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.write(WriteToDataSourceV2Exec.scala:516)
        at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.$anonfun$run$1(WriteToDataSourceV2Exec.scala:471)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1567)
        at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run(WriteToDataSourceV2Exec.scala:509)
        at org.apache.spark.sql.execution.datasources.v2.WritingSparkTask.run$(WriteToDataSourceV2Exec.scala:448)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:514)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:411)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1533)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

How to reproduce

use gravitino_catalog.test_db;
create table csv_table (id int, name string, age int) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' ;
insert into csv_table values(20, 'aaaa', 18);

Additional context

No response

FANNG1 commented 1 month ago

seems there is some special logic for OpenCSVSerde which treats all columns as strings, @theoryxu do you have time to dig out the reason and fix it?

theoryxu commented 1 month ago

seems there is some special logic for OpenCSVSerde which treats all columns as strings, @theoryxu do you have time to dig out the reason and fix it?

OK I'll figure it out

theoryxu commented 2 weeks ago

This issue is related to https://issues.apache.org/jira/browse/HIVE-13709, a weird Hive issue.

Because of HIVE-13709, the table's scheme differs between Gravitino & Hive Metastore when using OpenCSVSerde.

For Gravitino (desc formatted in SparkSql):