Kotlin / kotlin-spark-api

This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x
Apache License 2.0
456 stars 34 forks source link

Default value SparkHelper overrides config for spark.master #120

Closed christopherfrieler closed 2 years ago

christopherfrieler commented 2 years ago

Currently org.jetbrains.kotlinx.spark.api.withSpark(props, master, appName, logLevel, func) provides a default value for the parameter master. This overrides the value from the external spark config provided with the spark-submit command.

In my case, I'm running Spark on yarn, so I'm having a submit-command like this ./spark-submit --master yarn .... But when I start the SparkSession in my code with

withSpark(
    props = mapOf("some_key" to "some_value"),
) {
    //... my Spark code
}

, the default "local[*]" for the parameter master is used. This leads to a local Spark session on the application master (which kind of works, but does not make use of the executors provided by yarn) and yarn complaining that I never started a Spark session after my job finishes, because it doesn't recognize the local one.

I think, by default the value for master should be taken from the external config loaded by SparkConf(). As a workaround, I load the value myself:

withSpark(
    master = SparkConf().get("spark.master", "local[*]"),
    props = mapOf("some_key" to "some_value"),
) {
    //... my Spark code
}

This works, but is not nice and duplicates the default value "local[*]".

Jolanrensen commented 2 years ago

Thanks for letting us know! How do you suggest we set the default value? Also, maybe @asm0dey has an idea?

asm0dey commented 2 years ago

@christopherfrieler thanks for the idea! @Jolanrensen it looks like Christopher provided us with complete implementation, we should just change the default value :)

Jolanrensen commented 2 years ago

Sure! Just wanted to check whether this is the best way to go, since they did call it a "workaround" and "not nice", haha. But creating a new SparkConfig() does seem the best way to access system/jvm variables.

Jolanrensen commented 2 years ago

Added!