[SUPPORT] java.lang.NoSuchMethodError while launching Spark 2.3.1 in stand alone cluster mode.

deshpandeanoop commented 3 years ago

Hi Team,

I'm getting java.lang.NoSuchMethodError exception when I try launching the Spark application in stand alone mode. Exception Trace:

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
    at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
    at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
    at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
    at org.apache.hudi.spark.org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
    at org.apache.hudi.AvroConversionUtils$.convertStructTypeToAvroSchema(AvroConversionUtils.scala:52)
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:139)
    at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:654)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:654)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)

Below is my build.sbt file:

 scalaVersion := "2.11.8"

     libraryDependencies += ("org.apache.spark" % "spark-core_2.11" % "2.3.1"% "provided")
      .exclude("org.apache.avro", "avro")
      .exclude("org.apache.avro", "avro-ipc")
      .exclude("org.apache.avro", "avro-mapred")

     libraryDependencies += ("org.apache.spark" % "spark-sql_2.11" % "2.3.1" % "provided")
       .exclude("org.apache.avro", "avro")

     libraryDependencies += ("org.apache.hudi" % "hudi-spark-bundle_2.11" % "0.7.0")

     libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.4"

     libraryDependencies += ("com.typesafe.play" %% "play-json" % "2.4.0-M3")
       .exclude("org.slf4j", "slf4j-api")
       .exclude("org.slf4j", "slf4j-log4j12")
       .exclude("org.slf4j", "jcl-over-slf4j")
       .exclude("io.netty", "netty-all")

Spark submit command:

spark-submit --master local --jars <base-dir>/avro-1.8.2.jar --deploy-mode client \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
     --class "com.explore.hudi.HudiServiceMainJob" ApacheHudiService.jar

Versions of the softwares installed on my system,

Scala: 2.11
Spark: spark-2.3.1-bin-hadoop2.7

Just to add, I'm trying to read a csv extract and create a hudi table out of it, below is the my sample code that gets executed upon launching my spark application.

 sparkSession
      .read
      .csv(inputCsvAbsPath)
      .map(row => LibraryCheckoutInfo(
        bibNumber = row.getString(0),
        itemBarcode = row.getString(1),
        itemType = row.getString(2),
        collection = row.getString(3),
        callNumber = row.getString(4)))
      .write
      .format(AppConstants.SPARK_FORMAT_HUDI)
      .options(QuickstartUtils.getQuickstartWriteConfigs)
      .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "bibNumber")
      .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "itemType")
      .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "collection")
      .option(HoodieWriteConfig.TABLE_NAME, hudiTableName)
      .mode(SaveMode.Overwrite)
      .save(hudiTableBasePath)

yanghua commented 3 years ago

Could you please choose a Spark version greater than 2.4 or 2.4+?

deshpandeanoop commented 3 years ago

@yanghua : Is there anyway to have this working with spark version 2.3.1? Currently I'm trying to launch on my local machine for testing purposes. Once I'm done with my development, I will be launching it on our on-prem cluster(shared by multiple teams) which is having spark 2.3.1 installed and we don't have any control to upgrade it.

yanghua commented 3 years ago

Hi @vinothchandar Can you chime in to answer this question?

deshpandeanoop commented 3 years ago

@vinothchandar : A gentle reminder to help us out here :)

xushiyan commented 3 years ago

@deshpandeanoop noticed that you used a different version "spark-avro" % "2.4.4" rather than your main spark version 2.3.1

A more important matter: it has been discussed long time back where we only support spark 2.4+

https://lists.apache.org/thread.html/r19ec206e33f8b63e95a840bfba519ab89d1b7af790adef0bc369d618%40%3Cdev.hudi.apache.org%3E

Strongly suggest that you make efforts to upgrade Spark in some way. there are important avro updates in newer version. Spark 2.3.1 uses avro 1.7.7 which can be very problematic https://github.com/apache/spark/blob/30aaa5a3a1076ca52439a905274b1fcf498bc562/pom.xml#L142

apache / hudi

[SUPPORT] java.lang.NoSuchMethodError while launching Spark 2.3.1 in stand alone cluster mode. #3635