apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.32k stars 2.41k forks source link

[SUPPORT] Unable to insert record into Hudi table using Hudi Spark Connector through Golang #10675

Closed Shekkylar closed 3 months ago

Shekkylar commented 6 months ago

Issue Summary

Encountering challenges while integrating the Hudi Spark Connector with Golang. Insert, update, and upsert queries are resulting in errors, while create table and select queries work without issues.

Environment

Start Spark Server Command

Executed the following command in AWS EMR CLI within Spark:

cd /usr/lib/spark
./sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.12:3.5.0 \
  --jars /usr/lib/hudi/hudi-spark-bundle.jar
  --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"  \
  --conf "spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" \
  --conf "spark.sql.catalog.aws.glue.sync.tool.classes=org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool"

# Golang Code Snippet:

package main

import (
    "fmt"
    "log"

    "github.com/apache/spark-connect-go/v34/client/sql"
)

func main() {
    remote := "sc://localhost:8157"
    spark, err := sql.SparkSession.Builder.Remote(remote).Build()
    if err != nil {
        fmt.Println(err)
        log.Fatal("Failed to connect to Spark:", err)
    }

    // Example SQL query to show all tables
    query := "SHOW TABLES"
    alltab, err := spark.Sql(query)
    if err != nil {
        log.Fatal("Failed to execute SQL query:", err)
    }

    // Show the result
    alltab.Show(10, true)

    //Create the Hudi table with the basic schema
    _, err = spark.Sql(`create table hudi_table (
        id bigint,
        name string,
        dt string
      ) using hudi
      LOCATION "s3://spark-hudi-table/output/"
      TBLPROPERTIES (
        type = "cow",
        primaryKey = "id"
      )
      partitioned by (dt);`)
    if err != nil {
        fmt.Println("failed to Create", err)
    }

    //Insert data into the Hudi table
    _, err = spark.Sql(`insert into default.hudi_table (id, name, dt) VALUES (1, 'test 1', '2023-11-11'), (2, 'test 2', '2023-11-12');`)
    if err != nil {
        fmt.Println("Failed to insert data into Hudi table:", err)
    }

    //Query the Hudi table
    result, err := spark.Sql("SELECT * FROM hudi_table")
    if err != nil {
        fmt.Println("Failed to query Hudi table:", err)
    }

    // // Show the result
    result.Show(10, true)

    // Stop the Spark session
    spark.Stop()
}

# Issue Details:
While executing insert query, the Spark job fails with the following error taken fron Spark connector logs:

#encountered errors:

24/02/15 12:43:14 INFO Javalin: Starting Javalin ... 24/02/15 12:43:14 INFO Javalin: You are running Javalin 4.6.7 (released October 24, 2022. Your Javalin version is 479 days old. Consider checking for a newer version.). 24/02/15 12:43:14 INFO Javalin: Listening on http://localhost:39459/ 24/02/15 12:43:14 INFO Javalin: Javalin started in 151ms \o/ 24/02/15 12:43:14 INFO CodeGenerator: Code generated in 14.79973 ms 24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:14 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:15 INFO MultipartUploadOutputStream: close closed:false s3://spark-hudi-table/output/.hoodie/20240215124314041.commit.requested 24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:15 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/hoodie.properties' for reading 24/02/15 12:43:16 INFO MultipartUploadOutputStream: close closed:false s3://spark-hudi-table/output/.hoodie/metadata/.hoodie/hoodie.properties 24/02/15 12:43:17 INFO S3NativeFileSystem: Opening 's3://spark-hudi-table/output/.hoodie/metadata/.hoodie/hoodie.properties' for reading 24/02/15 12:43:17 INFO SparkContext: Starting job: Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ... 24/02/15 12:43:17 INFO DAGScheduler: Got job 0 (Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ...) with 1 output partitions 24/02/15 12:43:17 INFO DAGScheduler: Final stage: ResultStage 0 (Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ...) 24/02/15 12:43:17 INFO DAGScheduler: Parents of final stage: List() 24/02/15 12:43:17 INFO DAGScheduler: Missing parents: List() 24/02/15 12:43:17 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ...), which has no missing parents 24/02/15 12:43:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 112.6 KiB, free 1048.7 MiB) 24/02/15 12:43:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 42.1 KiB, free 1048.6 MiB) 24/02/15 12:43:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-46-34.ap-south-1.compute.internal:42365 (size: 42.1 KiB, free: 1048.8 MiB) 24/02/15 12:43:17 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1656 24/02/15 12:43:17 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ...) (first 15 tasks are for partitions Vector(0)) 24/02/15 12:43:17 INFO YarnScheduler: Adding task set 0.0 with 1 tasks resource profile 0 24/02/15 12:43:18 INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 1 for resource profile id: 0) 24/02/15 12:43:22 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.31.40.114:55534) with ID 2, ResourceProfileId 0 24/02/15 12:43:22 INFO ExecutorMonitor: New executor 2 has registered (new total is 1) 24/02/15 12:43:22 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-40-114.ap-south-1.compute.internal:45125 with 2.3 GiB RAM, BlockManagerId(2, ip-172-31-40-114.ap-south-1.compute.internal, 45125, None) 24/02/15 12:43:22 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (ip-172-31-40-114.ap-south-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 7863 bytes) 24/02/15 12:43:23 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-40-114.ap-south-1.compute.internal:45125 (size: 42.1 KiB, free: 2.3 GiB) 24/02/15 12:43:24 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (ip-172-31-40-114.ap-south-1.compute.internal executor 2): java.lang.ClassCastException: class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1046) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2446) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:143) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

24/02/15 12:43:24 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1) (ip-172-31-40-114.ap-south-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 7863 bytes) 24/02/15 12:43:24 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on ip-172-31-40-114.ap-south-1.compute.internal, executor 2: java.lang.ClassCastException (class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a)) [duplicate 1] 24/02/15 12:43:24 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2) (ip-172-31-40-114.ap-south-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 7863 bytes) 24/02/15 12:43:24 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on ip-172-31-40-114.ap-south-1.compute.internal, executor 2: java.lang.ClassCastException (class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a)) [duplicate 2] 24/02/15 12:43:24 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3) (ip-172-31-40-114.ap-south-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 7863 bytes) 24/02/15 12:43:24 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on ip-172-31-40-114.ap-south-1.compute.internal, executor 2: java.lang.ClassCastException (class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a)) [duplicate 3] 24/02/15 12:43:24 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job 24/02/15 12:43:24 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool 24/02/15 12:43:24 INFO YarnScheduler: Cancelling stage 0 24/02/15 12:43:24 INFO YarnScheduler: Killing all running tasks in stage 0: Stage cancelled: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-172-31-40-114.ap-south-1.compute.internal executor 2): java.lang.ClassCastException: class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1046) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2446) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:143) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

Driver stacktrace: 24/02/15 12:43:24 INFO DAGScheduler: ResultStage 0 (Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ...) failed in 6.879 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-172-31-40-114.ap-south-1.compute.internal executor 2): java.lang.ClassCastException: class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1046) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2446) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:143) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

Driver stacktrace: 24/02/15 12:43:24 INFO DAGScheduler: Job 0 failed: Spark Connect - session_id: "66f20158-e2df-4941-b6f4-4565c534143b" user_context { user_id: "na" } plan { command { sql_command { ..., took 6.932404 s 24/02/15 12:43:24 INFO DAGScheduler: Asked to cancel jobs with tag SparkConnect_OperationTag_User_66f20158-e2df-4941-b6f4-4565c534143b_Session_na_Operation_28cf34dd-febc-4944-b727-e6dff0849635 24/02/15 12:43:24 INFO ErrorUtils: Spark Connect error during: execute. UserId: na. SessionId: 66f20158-e2df-4941-b6f4-4565c534143b. org.apache.hudi.exception.HoodieException: Failed to instantiate Metadata table at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:293) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.client.SparkRDDWriteClient.initMetadataTable(SparkRDDWriteClient.java:273) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.client.BaseHoodieWriteClient.doInitTable(BaseHoodieWriteClient.java:1263) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.client.BaseHoodieWriteClient.initTable(BaseHoodieWriteClient.java:1303) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.client.SparkRDDWriteClient.insert(SparkRDDWriteClient.java:164) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:218) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.HoodieSparkSqlWriter$.writeInternal(HoodieSparkSqlWriter.scala:431) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:132) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:108) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:61) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:113) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:129) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$9(SQLExecution.scala:165) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:255) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$8(SQLExecution.scala:165) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:276) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:164) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:110) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:101) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:503) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) ~[spark-sql-api_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:503) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:33) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:33) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:479) ~[spark-catalyst_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:101) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:86) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.Dataset.(Dataset.scala:222) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:102) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:99) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.SparkSession.$anonfun$sql$4(SparkSession.scala:691) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:682) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.connect.planner.SparkConnectPlanner.handleSqlCommand(SparkConnectPlanner.scala:2461) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.planner.SparkConnectPlanner.process(SparkConnectPlanner.scala:2426) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handleCommand(ExecuteThreadRunner.scala:202) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:158) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:132) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:189) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) ~[spark-sql_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:189) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withContextClassLoader$1(SessionHolder.scala:176) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:179) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.sql.connect.service.SessionHolder.withContextClassLoader(SessionHolder.scala:175) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:188) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:132) ~[org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:84) [org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:228) [org.apache.spark_spark-connect_2.12-3.5.0.jar:3.5.0] Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (ip-172-31-40-114.ap-south-1.compute.internal executor 2): java.lang.ClassCastException: class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) at scala.collection.AbstractIterator.to(Iterator.scala:1431) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1046) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2446) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:143) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:3067) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:3003) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:3002) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) ~[scala-library-2.12.17.jar:?] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) ~[scala-library-2.12.17.jar:?] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ~[scala-library-2.12.17.jar:?] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:3002) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1318) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1318) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at scala.Option.foreach(Option.scala:407) ~[scala-library-2.12.17.jar:?] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1318) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3271) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3205) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:3194) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1041) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2406) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2427) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2446) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2471) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1046) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.rdd.RDD.withScope(RDD.scala:407) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.rdd.RDD.collect(RDD.scala:1045) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:362) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:361) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:116) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.listAllPartitionsFromFilesystem(HoodieBackedTableMetadataWriter.java:654) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeFromFilesystem(HoodieBackedTableMetadataWriter.java:388) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.initializeIfNeeded(HoodieBackedTableMetadataWriter.java:278) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.(HoodieBackedTableMetadataWriter.java:182) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.(SparkHoodieBackedTableMetadataWriter.java:95) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.metadata.SparkHoodieBackedTableMetadataWriter.create(SparkHoodieBackedTableMetadataWriter.java:72) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] at org.apache.hudi.client.SparkRDDWriteClient.initializeMetadataTable(SparkRDDWriteClient.java:287) ~[hudi-spark3.5-bundle_2.12-0.14.0-amzn-1.jar:0.14.0-amzn-1] ... 61 more Caused by: java.lang.ClassCastException: class org.apache.hudi.hadoop.SerializablePath cannot be cast to class org.apache.hudi.hadoop.SerializablePath (org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.MutableURLClassLoader @5b675711; org.apache.hudi.hadoop.SerializablePath is in unnamed module of loader org.apache.spark.util.ChildFirstURLClassLoader @1875dd5a) at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) ~[scala-library-2.12.17.jar:?] at scala.collection.Iterator.foreach(Iterator.scala:943) ~[scala-library-2.12.17.jar:?] at scala.collection.Iterator.foreach$(Iterator.scala:943) ~[scala-library-2.12.17.jar:?] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) ~[scala-library-2.12.17.jar:?] at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) ~[scala-library-2.12.17.jar:?] at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) ~[scala-library-2.12.17.jar:?] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) ~[scala-library-2.12.17.jar:?] at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce.to(TraversableOnce.scala:366) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce.to$(TraversableOnce.scala:364) ~[scala-library-2.12.17.jar:?] at scala.collection.AbstractIterator.to(Iterator.scala:1431) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358) ~[scala-library-2.12.17.jar:?] at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1431) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345) ~[scala-library-2.12.17.jar:?] at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339) ~[scala-library-2.12.17.jar:?] at scala.collection.AbstractIterator.toArray(Iterator.scala:1431) ~[scala-library-2.12.17.jar:?] at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1046) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2446) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.scheduler.Task.run(Task.scala:143) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) ~[spark-common-utils_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) ~[spark-common-utils_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) ~[spark-core_2.12-3.5.0-amzn-0.jar:3.5.0-amzn-0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:840) ~[?:?] 24/02/15 12:43:24 INFO ExecuteGrpcResponseSender: Stream finished for opId=28cf34dd-febc-4944-b727-e6dff0849635, sent all responses up to last index 0. totalTime=10955578375ns waitingForResults=10955491381ns waitingForSend=0ns 24/02/15 12:43:24 INFO SparkConnectExecutionManager: ExecuteHolder ExecuteKey(na,66f20158-e2df-4941-b6f4-4565c534143b,28cf34dd-febc-4944-b727-e6dff0849635) is removed. 24/02/15 12:43:24 INFO ExecuteResponseObserver: Release all for opId=28cf34dd-febc-4944-b727-e6dff0849635. Execution stats: total=CachedSize(0,0) autoRemoved=CachedSize(0,0) cachedUntilConsumed=CachedSize(0,0) cachedUntilProduced=CachedSize(0,0) maxCachedUntilConsumed=CachedSize(0,0) maxCachedUntilProduced=CachedSize(0,0) 24/02/15 12:43:24 INFO SparkConnectExecutionManager: ExecuteHolder ExecuteKey(na,66f20158-e2df-4941-b6f4-4565c534143b,c3ec3331-c542-4541-8d73-1d71d3c7e3a3) is created. 24/02/15 12:43:24 INFO ExecuteGrpcResponseSender: Starting for opId=c3ec3331-c542-4541-8d73-1d71d3c7e3a3, reattachable=false, lastConsumedStreamIndex=0 24/02/15 12:43:24 INFO ExecuteGrpcResponseSender: Stream finished for opId=c3ec3331-c542-4541-8d73-1d71d3c7e3a3, sent all responses up to last index 2. totalTime=160844370ns waitingForResults=158793210ns waitingForSend=0ns 24/02/15 12:43:24 INFO SparkConnectExecutionManager: ExecuteHolder ExecuteKey(na,66f20158-e2df-4941-b6f4-4565c534143b,c3ec3331-c542-4541-8d73-1d71d3c7e3a3) is removed. 24/02/15 12:43:24 INFO ExecuteResponseObserver: Release all for opId=c3ec3331-c542-4541-8d73-1d71d3c7e3a3. Execution stats: total=CachedSize(424,2) autoRemoved=CachedSize(147,1) cachedUntilConsumed=CachedSize(0,0) cachedUntilProduced=CachedSize(0,0) maxCachedUntilConsumed=CachedSize(424,2) maxCachedUntilProduced=CachedSize(424,2) 24/02/15 12:43:24 INFO SparkConnectExecutionManager: ExecuteHolder ExecuteKey(na,66f20158-e2df-4941-b6f4-4565c534143b,82079c9e-3ef5-485c-9b0c-2fe71d117eb6) is created. 24/02/15 12:43:24 INFO ExecuteGrpcResponseSender: Starting for opId=82079c9e-3ef5-485c-9b0c-2fe71d117eb6, reattachable=false, lastConsumedStreamIndex=0 24/02/15 12:43:24 INFO CodeGenerator: Code generated in 42.971768 ms 24/02/15 12:43:24 INFO ExecuteGrpcResponseSender: Stream finished for opId=82079c9e-3ef5-485c-9b0c-2fe71d117eb6, sent all responses up to last index 3. totalTime=404638217ns waitingForResults=403497209ns waitingForSend=0ns 24/02/15 12:43:24 INFO SparkConnectExecutionManager: ExecuteHolder ExecuteKey(na,66f20158-e2df-4941-b6f4-4565c534143b,82079c9e-3ef5-485c-9b0c-2fe71d117eb6) is removed. 24/02/15 12:43:24 INFO ExecuteResponseObserver: Release all for opId=82079c9e-3ef5-485c-9b0c-2fe71d117eb6. Execution stats: total=CachedSize(1243,3) autoRemoved=CachedSize(1060,2) cachedUntilConsumed=CachedSize(0,0) cachedUntilProduced=CachedSize(0,0) maxCachedUntilConsumed=CachedSize(1105,2) maxCachedUntilProduced=CachedSize(1105,2) 24/02/15 12:43:36 INFO SparkConnectExecutionManager: Started periodic run of SparkConnectExecutionManager maintenance. 24/02/15 12:43:36 INFO SparkConnectExecutionManager: Finished periodic run of SparkConnectExecutionManager maintenance. 24/02/15 12:44:06 INFO SparkConnectExecutionManager: Started periodic run of SparkConnectExecutionManager maintenance. 24/02/15 12:44:06 INFO SparkConnectExecutionManager: Finished periodic run of SparkConnectExecutionManager maintenance.

...

Request for Assistance:

I kindly request assistance from the community to help identify and resolve the issue. The create table and select queries work as expected, but there seems to be a problem with insert, update, and upsert queries.

ad1happy2go commented 6 months ago

@Shekkylar Looks like some library conflicts. Adding @CTTY in case he have any insights here.

Shekkylar commented 6 months ago

@ad1happy2go Can you provide suggestions for resolving a library conflict?

ad1happy2go commented 6 months ago

@Shekkylar As hudi release OSS version still don't support spark 3.5, I suggest you to use spark 3.4 and then use OSS hudi version 0.14.X

ad1happy2go commented 3 months ago

@Shekkylar Until Hudi 0.15 is released, you can use this branch https://github.com/apache/hudi/tree/release-0.14.1-spark35-scala213 for 0.14.1 support to spark 3.5