apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.94k stars 692 forks source link

Unable to Load Geojson File using Sedona Context in Databricks #1617

Open Kunal-Mishra10 opened 1 week ago

Kunal-Mishra10 commented 1 week ago

Expected behavior

I am trying to execute the following code in Databricks as mentioned in the Sedona Official Doc

df = sedona.read.format("geojson").option("multiLine", "true").load("PATH/TO/MYFILE.json") .selectExpr("explode(features) as features") # Explode the envelope to get one feature per row. .select("features.*") # Unpack the features struct. .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type")

df.show() df.printSchema()

Ref : https://sedona.apache.org/latest-snapshot/tutorial/sql/#__tabbed_14_3

I am getting the following error

Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.json.JsonDataSource.readFile(Lorg/apache/hadoop/conf/Configuration;Lorg/apache/spark/sql/execution/datasources/PartitionedFile;Lorg/apache/spark/sql/catalyst/json/JacksonParser;Lorg/apache/spark/sql/types/StructType;)Lscala/collection/Iterator;

Actual behavior

geojson file should be loaded into the dataframe

Steps to reproduce the problem

I have installed the following jar files

I have installed the following libraries

Settings

Sedona version = 1.6.1

Apache Spark version = 3.5.0 (Not working with Spark 3.4 Version as well)

Apache Flink version = NA

API type = Scala, Java, Python? Python

Scala version = 2.11, 2.12, 2.13? 2.12

JRE version = 1.8, 1.11? 1.8

Python version = ?

Environment = Standalone, AWS EC2, EMR, Azure, Databricks?

github-actions[bot] commented 1 week ago

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

james-willis commented 5 days ago

Are you using shared access cluster in Databricks?

Copying something Jia said in another thread:

the Shared Access cluster on Databricks does not allow Spark DataSourceV2. This will prevent you from using Sedona GeoJSON reader/writer, GeoParquet reader/writer. Until Databricks fixes this limitation, you won't be able to use these data sources on Databricks Shared access cluster.