G-Research / spark-extension

A library that provides useful extensions to Apache Spark and PySpark.
Apache License 2.0
193 stars 26 forks source link

Support Spark Connect server #248

Open EnricoMi opened 1 month ago

EnricoMi commented 1 month ago

By introducing the Spark Connect server, a Spark application can run on a remote server via the Spark Connect protocol: https://semyonsinchenko.github.io/ssinchenko/post/how-databricks-14x-breaks-3dparty-compatibility/

This new feature removes the direct access to the JVM for PySpark via py4j. Almost all features of the spark-extension PySpark package rely on py4j.

The Spark Connect protocol supports plugins for Relations (DataFrames), Commands (side-effect actions without returning data) and Expressions. This can be used to gain access to JVM-side classes and instances: https://semyonsinchenko.github.io/ssinchenko/post/extending-spark-connect/

Alternatively, any logic based on Scala Dataset API can be rewritten purely in PySpark DataFrame API. However, this duplicates code. A/B testing required.

Making Scala classes available through Spark Connect plugins also requires some duplication of classes in Python and Protobuf. Additionally, such plugins require some more configuration on the Spark Connect server to work.

SemyonSinchenko commented 1 month ago

I briefly checked the code, and it looks like it contains mostly pure functions. I did not see any hard to migrate cases like a JVM object with a mutable state. I see it like it is not very hard to define main structures (like parameters case classes) as proto messages and write a Connect Plugin.

But the problem is that generated by protoc Python code won't fit with the current PySpark API, so for a backward compatibility one needs to re-create an existing Python API on top of proto messages. And in the same moment, to support PySpark Classic, the old py4j-based API should be maintained too. Alternatively, the whole API may be rewritten to use proto messages with a kind of dispatch mechanism (like if we are in Connect env we send proto as bytes to the server and if we are in Classic env we send proto as bytes directly to _jvm).

But it seems to me that it would be easier to rewrite an existing python API to pure PySpark. Just because creating a dispatch mechanism is required rewriting the whole API anyway.

I can help with both options anyway; it looks like an interesting task.

EnricoMi commented 1 month ago

I think I'll go along rewriting everything in PySpark where possible. This keeps the tech stack tidy and is backward compatible. While going along this route, I might find some pieces where calling into the JVM via a Spark Connect plugin turns out to be worth the effort and start this endeavour.

I like the fact that you can use the protobuf messages with the classic Spark session, as described in the bonus section of your blog post. However, this only works from Spark 3.4 and above.

EnricoMi commented 1 month ago

@SemyonSinchenko I think my biggest concern is that in order to have this package work with a connect extension, users have to add the spark.connect.extensions.*.classes config to their environment. Would be great if adding the --package ... would be sufficient to also add the conf.

Do you know there is some way to embed the Spark conf into the package jar?

SemyonSinchenko commented 1 month ago

I do not know such a way and I'm thinking about the same. I'm going to test is it possible to change CONNECT_EXTENSIONS_RELATION_CLASSES in runtime from the user's (like spark.conf.set) code... If it works, the solution would be just give a user an example what should be added to the conf or even do it implicitly by the python library code