Kotlin / kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Apache License 2.0
455 stars 34 forks source link

Questions regarding `java.lang.NoSuchMethodError` #184

Open hawkaa opened 1 year ago

hawkaa commented 1 year ago

Hi @Jolanrensen ,

I'm sorry toss these issues/questions at you without being able to provide much of assistance myself. I really appreciate you taking your time to help out.

We're in the progress of updating our code to Spark 3.3.0, but we're seeing some errors. We're not experiencing the errors locally in our unit tests, but in Databricks and their runtime 11.2. I can confirm that that runtime uses Spark 3.3.0, Scala 2.12 which is also what we use together with org.jetbrains.kotlinx.spark:kotlin-spark-api_3.3.0_2.12:1.2.1. My theory is that Databricks doesn't use the open source version of spark, hence we get the following error message:

java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.expressions.Expression org.apache.spark.sql.catalyst.DeserializerBuildHelper$.deserializerForWithNullSafetyAndUpcast(org.apache.spark.sql.catalyst.expressions.Expression, org.apache.spark.sql.types.DataType, boolean, org.apache.spark.sql.catalyst.WalkedTypePath, scala.Function2)'
    at org.apache.spark.sql.KotlinReflection$.deserializerFor(KotlinReflection.scala:671)
    at org.apache.spark.sql.KotlinReflection.deserializerFor(KotlinReflection.scala)
    at org.jetbrains.kotlinx.spark.api.EncodingKt.kotlinClassEncoder(Encoding.kt:169)
    at org.jetbrains.kotlinx.spark.api.EncodingKt.generateEncoder(Encoding.kt:144)
    at xyz.dune.arrakis.spark.AutoBackfillingJoin.streamJoin(AutoBackfillingJoin.kt:533)

Any thoughts on whether that theory might be correct? As far as I know, these errors normally is a consequence of mismatching versions. Could it be that the databricks runtime has some custom spark code where deserializerForWithNullSafetyAndUpcast doesn't exist or have a different signature?

Thanks again. Much appreciated!

Jolanrensen commented 1 year ago

That theory might definitely be correct. I've heard about Databricks storing Datasets differently as well (stricter than spark itself in some cases). Since we replace some code inside the Spark library itself we need an exact match to be able to call the code. I don't know much about Databricks though, do they have a published package of their spark version? Or can you somehow specify which Spark package to use?

"Databricks continues to develop and release features to Apache Spark. The Databricks Runtime includes additional optimizations and proprietary features that build upon and extend Apache Spark, including [Photon], an optimized version of Apache Spark rewritten in C++." right... Omg, they indeed have a LOT of changes done to Spark: https://docs.databricks.com/release-notes/runtime/11.3.html#apache-spark no wonder it doesn't work...

Also, I see they use Scala 2.12.14, we use 2.12.16. Might be worth it to try a ./gradlew -Pspark=3.3.0 -Pscala=2.12.14 clean publishMavenPublicationToMavenLocal from the kotlin api sources just to be sure it's not that.

Jolanrensen commented 1 year ago

So I've sent an email to Databricks with this issue. Not sure if they can or want to help, but it can't hurt to try, can it? :)

hawkaa commented 1 year ago

So I've sent an email to Databricks with this issue. Not sure if they can or want to help, but it can't hurt to try, can it? :)

Thanks a ton! I'm also in touch with them. Really appreciate it!

Jolanrensen commented 1 year ago

So I've sent an email to Databricks with this issue. Not sure if they can or want to help, but it can't hurt to try, can it? :)

Thanks a ton! I'm also in touch with them. Really appreciate it!

The more people who ask for it, the better :)