Closed dannymeijer closed 3 weeks ago
@dannymeijer Can you provide more context, why do we need to provide a session which is different from the active one ? Based on current implementation koheesio will take current Active Session. Building session is responsibility of user ( where they can create local/ remote session)
I don't intend to change the behavior, just want users to be able to explicitly be able to pass the SparkSession if they have configured one. This is to have compatibility with delta API for example.
I don't intend to change the behavior, just want users to be able to explicitly be able to pass the SparkSession if they have configured one. This is to have compatibility with delta API for example.
As for me I prefer not to have it separately and only allow active spark session, without possibility to pass anything.
If this is method of a class:
def clean_up_spark_streaming_checkpoint(spark: SparkSession) -> SparkSession:
# add desired config to the given spark session and return it
return spark
then it should be as:
def clean_up_spark_streaming_checkpoint(self) -> SparkSession:
# add desired config to the given spark session and return it
# self.spark.config.set()
return self.spark
Is your feature request related to a problem? Please describe.
A feature that was previously requested (aligned to this) was:
Currently, Koheesio has no specific interaction with or management of the underlying SparkSession (it simply expects the session to just 'be there').
Describe the solution you'd like
I would like to see us extend the SparkStep to be able to manage the spark session, and also be able to pass the SparkSession to the Koheesio Step as a field. At the moment, spark is only set up as a property. I propose we keep the property, but add a Field with a default. The property would then either pass whatever value is assigned to
self._spark
or the getActiveSession (as it does currently).By doing this we can introduce functionality like this (inspired by Delta library):
Suggestion on new SparkStep code:
This opens the door to future optimizations also.
Describe alternatives you've considered
Open for suggestions :)
Additional context
For reference, a user would call the functionality like this: