APache-sedona Failure - Githubissues

apache / sedona

A cluster computing framework for processing large-scale geospatial data

https://sedona.apache.org/

Apache License 2.0

1.96k stars 692 forks source link

APache-sedona Failure #1688

Open tony189 opened 1 day ago

tony189 commented 1 day ago

Installing JAR libraries from initScript and then apache-sedona 1.6.0 or 1.6.1 makes imposible to execute any notebook. throwing the error

Failure starting repl. Try detaching and re-attaching the notebook.

at com.databricks.spark.chauffeur.ExecContextState.processInternalMessage(ExecContextState.scala:347)
at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)

Doesn't matther if reattach, create new notebook ,or restart... always fails. Without apache-sedona everything works

Settings

Sedona version = sedona-spark-shaded-3.4_2.12-1.6.1.jar geotools-wrapper-1.6.1-28.2.jar

Apache Spark version = 3.5.0

Environment = Azure, Databricks

github-actions[bot] commented 1 day ago

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

Kontinuation commented 1 day ago

Do you have additional python libraries installed (including the apache-sedona python library)? I've seen similar issues before and resolved it by adding 2 more dependencies for pinning the versions of numpy and pandas:

numpy<1.24
pandas==1.5.3

According to the linked issue, Installing rasterio<1.4.0 before installing sedona also resolves this problem. If this does not resolve this problem, you can head to the "Driver logs" of your Databricks cluster to gather more information about this problem.

jiayuasu commented 21 hours ago

In addition, if you use Spark 3.5.0, the sedona jar version should be sedona-spark-shaded-3.5_2.12-1.6.1.jar not 3.4

tony189 commented 3 hours ago

I didn't realize what you point @jiayuasu, I changed it but still didn't wok.

Installing numpy and pandas as @Kontinuation said, fix the problem... even though both libraries are included in every databricks cluster as a standard config with those exact conditions....

However in this way the cluster takes ages to start... I think I'm gona stuck with the sql and forget about python.