Closed sebbegg closed 4 weeks ago
@sebbegg Please follow our Databricks tutorial here. In short, you don't need to call SedonaContext.create()
on Databricks because Sedona is registered via another config.
See https://sedona.apache.org/1.6.1/setup/databricks/#advanced-editions
That said, the Shared Access cluster on Databricks does not allow Spark DataSourceV2. This will prevent you from using Sedona GeoJSON reader/writer, GeoParquet reader/writer. Until Databricks fixes this limitation, you won't be able to use these data sources on Databricks Shared access
cluster.
Our setup matches what your docs state and we don't have any issues with sedona on normal clusters or those with single-user mode. I think the screenshot below shows that - the SQL api is fine, what seems broken is the python api:
In the end I think this boils down to the fact in with spark-connect, the _jvm
attribute that's used throughout the python api:
Doesn't exist in a spark connect session:
@sebbegg Got it. Yes, this is true unfortunately. Our Python DataFrame API relies the JVM attribute, which in inevitable.
Did a bit of digging into what spark does in a spark-connect environment and to me it seems rather straight forward:
https://github.com/apache/spark/blob/v3.5.3/python/pyspark/sql/connect/functions.py#L3899-L3901
Applying this to sedona:
I think in this fashion it should be possible to make call_sedona_function
spark-connect compatible:
https://github.com/apache/sedona/blob/678da0044c7eacc48db68ff96a41cf08e6a5f2a8/python/sedona/sql/dataframe_api.py#L51-L74
The call_function
only exists since spark 3.5.0, but the content seems simple enough to maybe integrate into sedona.
Would you be willing to accept a PR if I'd find the time to investigate? My expectation is that spark-connect might become more common in the future.
@sebbegg 100%! We will be happy to accept this PR if you can make it work!
Hi,
we're currently using sedona mostly in on Azure Databricks. With databricks strongly promoting it's Unity Catalog we've tested/investigated this and found some issues with running sedona on Databricks Clusters using the shared access mode. In this context spark (apparently) runs via spark-connect.
Actual behavior
Calling
SedonaContext.create(spark)
on such a cluster produces the following output:After doing a bit of research, it seems that the python API in general heavily relies on the
spark._jvm
attribute, which doesn't exist in when running via spark-connect. Are there any plans/possibilities to make the python apis spark-connect compatible?Steps to reproduce the problem
Setup databricks cluster:
%pip install apache-sedona==1.6.0
(with appropriate java deps installed)Settings
Sedona version = 1.6.0
Apache Spark version = 3.5.0
API type = Python
Scala version = 2.12
JRE version = 1.8
Python version = 3.10
Environment = Azure Databricks