Sedona python API on databricks unity / spark connect

sebbegg commented 1 month ago

Hi,

we're currently using sedona mostly in on Azure Databricks. With databricks strongly promoting it's Unity Catalog we've tested/investigated this and found some issues with running sedona on Databricks Clusters using the shared access mode. In this context spark (apparently) runs via spark-connect.

Actual behavior

Calling SedonaContext.create(spark) on such a cluster produces the following output:

After doing a bit of research, it seems that the python API in general heavily relies on the spark._jvm attribute, which doesn't exist in when running via spark-connect. Are there any plans/possibilities to make the python apis spark-connect compatible?

Steps to reproduce the problem

Setup databricks cluster:

Databricks Runtime 14.3 LTS
%pip install apache-sedona==1.6.0 (with appropriate java deps installed)

Settings

Sedona version = 1.6.0

Apache Spark version = 3.5.0

API type = Python

Scala version = 2.12

JRE version = 1.8

Python version = 3.10

Environment = Azure Databricks

jiayuasu commented 1 month ago

@sebbegg Please follow our Databricks tutorial here. In short, you don't need to call SedonaContext.create() on Databricks because Sedona is registered via another config.

See https://sedona.apache.org/1.6.1/setup/databricks/#advanced-editions

jiayuasu commented 1 month ago

That said, the Shared Access cluster on Databricks does not allow Spark DataSourceV2. This will prevent you from using Sedona GeoJSON reader/writer, GeoParquet reader/writer. Until Databricks fixes this limitation, you won't be able to use these data sources on Databricks Shared access cluster.

sebbegg commented 1 month ago

Our setup matches what your docs state and we don't have any issues with sedona on normal clusters or those with single-user mode. I think the screenshot below shows that - the SQL api is fine, what seems broken is the python api:

sebbegg commented 1 month ago

In the end I think this boils down to the fact in with spark-connect, the _jvm attribute that's used throughout the python api:

https://github.com/apache/sedona/blob/678da0044c7eacc48db68ff96a41cf08e6a5f2a8/python/sedona/sql/dataframe_api.py#L70

Doesn't exist in a spark connect session:

https://github.com/apache/spark/blob/b056e0b12786f0b85675cdf73748bdf506e3619f/python/pyspark/sql/connect/session.py#L915-L920

jiayuasu commented 1 month ago

@sebbegg Got it. Yes, this is true unfortunately. Our Python DataFrame API relies the JVM attribute, which in inevitable.

sebbegg commented 1 month ago

Did a bit of digging into what spark does in a spark-connect environment and to me it seems rather straight forward:

https://github.com/apache/spark/blob/v3.5.3/python/pyspark/sql/connect/functions.py#L3899-L3901

Applying this to sedona:

I think in this fashion it should be possible to make call_sedona_function spark-connect compatible: https://github.com/apache/sedona/blob/678da0044c7eacc48db68ff96a41cf08e6a5f2a8/python/sedona/sql/dataframe_api.py#L51-L74

The call_function only exists since spark 3.5.0, but the content seems simple enough to maybe integrate into sedona.

Would you be willing to accept a PR if I'd find the time to investigate? My expectation is that spark-connect might become more common in the future.

jiayuasu commented 1 month ago

@sebbegg 100%! We will be happy to accept this PR if you can make it work!

sebbegg commented 1 month ago

I've created https://issues.apache.org/jira/browse/SEDONA-663 and https://github.com/apache/sedona/pull/1639.

apache / sedona