elastic / elasticsearch-hadoop

:elephant: Elasticsearch real-time search and analytics natively integrated with Hadoop
https://www.elastic.co/products/hadoop
Apache License 2.0
1.93k stars 986 forks source link

ES-hadoop is not compatible with spark 3.5.1 #2210

Open edward-capriolo-db opened 3 months ago

edward-capriolo-db commented 3 months ago

What kind an issue is this?

Issue description

Spark 3.5.1 has changed some UDF code in catalyst which breaks a number of applications built against older versions of spark

Steps to reproduce

Code:

es.writeStream().... 

Strack trace:

2024-04-01 21:49:13 ERROR streaming.MicroBatchExecution:97 - Query reconquery [id = 4ead2d05-8e7f-4d9f-bbd2-9153441d2cb5, runId = dfabec28-8824-46d1-b573-5a49b5352ccd] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 8) (lonasworkd1.uk.db.com executor 2): java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(Lorg/apache/spark/sql/types/StructType;)Lorg/apache/spark/sql/catalyst/encoders/ExpressionEncoder;
    at org.elasticsearch.spark.sql.streaming.EsStreamQueryWriter.<init>(EsStreamQueryWriter.scala:50)
    at org.elasticsearch.spark.sql.streaming.EsSparkSqlStreamingSink.$anonfun$addBatch$5(EsSparkSqlStreamingSink.scala:72)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
    at org.apache.spark.scheduler.Task.run(Task.scala:141)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
    at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Version Info

OS: : Linux JVM : JDK8/11 Hadoop/Spark:
ES-Hadoop :

      <dependency>
                 <groupId>org.elasticsearch</groupId>
                 <artifactId>elasticsearch-spark-30_${scala.version}</artifactId>
               <version>8.13.0</version>
             </dependency>

ES : 7.X latest.

Feature description

edward-capriolo-db commented 3 months ago

Many projects have similar issues from the spark api change

https://stackoverflow.com/questions/77797606/rowencoder-apply-and-rowencoder-encoderfor-methods-in-spark-catalyst-package

masseyke commented 3 months ago

Thanks for the report!

edward-capriolo-db commented 3 months ago

[INFO] | | +- org.apache.spark:spark-network-common_2.12:jar:3.5.1:compile [INFO] | | | - com.google.crypto.tink:tink:jar:1.9.0:compile [INFO] | | | +- com.google.code.gson:gson:jar:2.8.9:compile [INFO] | | | +- com.google.protobuf:protobuf-java:jar:3.19.6:compile [INFO] | | | - joda-time:joda-time:jar:2.12.5:compile

Also dependencies are bringing in a protobuf that is old that sets off OSS vulnerability scanning. [INFO] +- org.elasticsearch:elasticsearch-spark-30_2.12:jar:8.13.0:compile [INFO] | +- org.scala-lang:scala-reflect:jar:2.12.17:compile [INFO] | +- commons-logging:commons-logging:jar:1.1.1:compile [INFO] | +- javax.xml.bind:jaxb-api:jar:2.3.1:runtime [INFO] | - com.google.protobuf:protobuf-java:jar:2.5.0:compile

masseyke commented 2 months ago

Upgrading to Spark 3.5 is going to be tricky because of compiler errors like this caused by a breaking change in the spark API:

[Error] /Users/kmassey/workspace/elasticsearch-hadoop/spark/core/src/main/scala/org/elasticsearch/spark/package.scala:34:42: Symbol 'type org.apache.spark.internal.Logging' is missing from the classpath.
This symbol is required by 'class org.apache.spark.SparkContext'.
Make sure that type Logging is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'SparkContext.class' was compiled against an incompatible version of org.apache.spark.internal.
[Error] /Users/kmassey/workspace/elasticsearch-hadoop/spark/core/src/main/scala/org/elasticsearch/spark/rdd/EsSpark.scala:25:8: Symbol 'type org.apache.spark.internal.Logging' is missing from the classpath.
This symbol is required by 'class org.apache.spark.rdd.RDD'.
Make sure that type Logging is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'RDD.class' was compiled against an incompatible version of org.apache.spark.internal.
[Error] /Users/kmassey/workspace/elasticsearch-hadoop/spark/core/src/main/scala/org/elasticsearch/spark/cfg/SparkSettingsManager.java:21:8: Symbol 'type org.apache.spark.internal.Logging' is missing from the classpath.
This symbol is required by 'class org.apache.spark.SparkConf'.
Make sure that type Logging is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
A full rebuild may help if 'SparkConf.class' was compiled against an incompatible version of org.apache.spark.internal.
three errors found

I think we'll have to move several more classes from our spark core package down into the various spark-version-specific packages.

edward-capriolo-db commented 2 months ago

These are unavoidable, previously in Hive we had made "shim layers" and used reflection to deal with breaking API changes. I will look into at least getting it working and then we can see what the change set is.

ps-rterzman commented 2 weeks ago

Is there any updates on that?

masseyke commented 2 weeks ago

We recently added support for 3.4.3, but we have not dealt with the big changes in 3.5 yet.