ES-hadoop is not compatible with spark 3.5.1

edward-capriolo-db commented 3 months ago

What kind an issue is this?

[ X ] Bug report. If you’ve found a bug, please provide a code snippet or test to reproduce it below.
The easier it is to track down the bug, the faster it is solved.
[ ] Feature Request. Start by telling us what problem you’re trying to solve.
Often a solution already exists! Don’t send pull requests to implement new features without first getting our support. Sometimes we leave features out on purpose to keep the project small.

Issue description

Spark 3.5.1 has changed some UDF code in catalyst which breaks a number of applications built against older versions of spark

Steps to reproduce

Code:

es.writeStream()....

Strack trace:

2024-04-01 21:49:13 ERROR streaming.MicroBatchExecution:97 - Query reconquery [id = 4ead2d05-8e7f-4d9f-bbd2-9153441d2cb5, runId = dfabec28-8824-46d1-b573-5a49b5352ccd] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8) (lonasworkd1.uk.db.com executor 2): java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/catalyst/encoders/ExpressionEncoder;
    at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.<init>(EsStreamQueryWriter.scala:50)
    at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink.$anonfun$addBatch$5(EsSparkSqlStreamingSink.scala:72)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Version Info

OS: : Linux JVM : JDK8/11 Hadoop/Spark:
ES-Hadoop :

      <dependency>
                 <groupId>org.elasticsearch</groupId>
                 <artifactId>elasticsearch-spark-30_${scala.version}</artifactId>
               <version>8.13.0</version>
             </dependency>

ES : 7.X latest.

Feature description

edward-capriolo-db commented 3 months ago

Many projects have similar issues from the spark api change

https://stackoverflow.com/questions/77797606/rowencoder-apply-and-rowencoder-encoderfor-methods-in-spark-catalyst-package

masseyke commented 3 months ago

Thanks for the report!

edward-capriolo-db commented 3 months ago

[INFO] | | +- org.apache.spark:spark-network-common_2.12:jar:3.5.1:compile [INFO] | | | - com.google.crypto.tink:tink:jar:1.9.0:compile [INFO] | | | +- com.google.code.gson:gson:jar:2.8.9:compile [INFO] | | | +- com.google.protobuf:protobuf-java:jar:3.19.6:compile [INFO] | | | - joda-time:joda-time:jar:2.12.5:compile

Also dependencies are bringing in a protobuf that is old that sets off OSS vulnerability scanning. [INFO] +- org.elasticsearch:elasticsearch-spark-30_2.12:jar:8.13.0:compile [INFO] | +- org.scala-lang:scala-reflect:jar:2.12.17:compile [INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | +- javax.xml.bind:jaxb-api:jar:2.3.1:runtime [INFO] | - com.google.protobuf:protobuf-java:jar:2.5.0:compile

masseyke commented 2 months ago

Upgrading to Spark 3.5 is going to be tricky because of compiler errors like this caused by a breaking change in the spark API:

[Error] /Users/kmassey/workspace/elasticsearch-hadoop/spark/core/src/main/scala/org/elasticsearch/spark/package.scala:34:42: Symbol 'type org.apache.spark.internal.Logging' is missing from the classpath.
This symbol is required by 'class org.apache.spark.SparkContext'.
Make sure that type Logging is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'SparkContext.class' was compiled against an incompatible version of org.apache.spark.internal.
[Error] /Users/kmassey/workspace/elasticsearch-hadoop/spark/core/src/main/scala/org/elasticsearch/spark/rdd/EsSpark.scala:25:8: Symbol 'type org.apache.spark.internal.Logging' is missing from the classpath.
This symbol is required by 'class org.apache.spark.rdd.RDD'.
Make sure that type Logging is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'RDD.class' was compiled against an incompatible version of org.apache.spark.internal.
[Error] /Users/kmassey/workspace/elasticsearch-hadoop/spark/core/src/main/scala/org/elasticsearch/spark/cfg/SparkSettingsManager.java:21:8: Symbol 'type org.apache.spark.internal.Logging' is missing from the classpath.
This symbol is required by 'class org.apache.spark.SparkConf'.
Make sure that type Logging is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'SparkConf.class' was compiled against an incompatible version of org.apache.spark.internal.
three errors found

I think we'll have to move several more classes from our spark core package down into the various spark-version-specific packages.

edward-capriolo-db commented 2 months ago

These are unavoidable, previously in Hive we had made "shim layers" and used reflection to deal with breaking API changes. I will look into at least getting it working and then we can see what the change set is.

ps-rterzman commented 2 weeks ago

Is there any updates on that?

masseyke commented 2 weeks ago

We recently added support for 3.4.3, but we have not dealt with the big changes in 3.5 yet.

elastic / elasticsearch-hadoop