apache / sedona

A cluster computing framework for processing large-scale geospatial data
https://sedona.apache.org/
Apache License 2.0
1.96k stars 695 forks source link

No geo metadata found with geoparquet #1059

Closed FengJiang2018 closed 1 year ago

FengJiang2018 commented 1 year ago

Expected behavior

geoparquet should have geo metadata be generated and should not raise error during read by using

 df = sedona.read.format("geoparquet").load(path)

Here is the error details

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 4 times, most recent failure: Lost task 0.3 in stage 22.0 (TID 115) (10.139.64.12 executor 0): java.lang.IllegalArgumentException: GeoParquet file does not contain valid geo metadata

Actual behavior

geoparquet was created without geo metadata and got error during read by using

 df = sedona.read.format("geoparquet").load(path)

Steps to reproduce the problem

Seems like the issue is when I was using df.write to a geoparquet file, the geo metadata was not created for the Sedona geometry column. I am not sure if anything I missed.

1, I am using overture public dataset as input for the dataframe as following with Sedona Geometry column

df_building = sedona.read.option("inferschema",True).parquet(inputpath) \
        .withColumn("geometry2",expr("ST_GeomFromWKB(geometry)"))
df_building.createOrReplaceTempView("rawdf")

2, Yes I am using DataFrame to write a geoparquet file with Sedona Geometry Type column on databricks.

new_df = spark.sql("select *, ST_GeoHash(geometry2, 5) as geohash  from rawdf order by geohash").drop("geometry").withColumnRenamed("geometry2", "geometry")
new_df.write.mode("overwrite").format("geoparquet") \
        .save(path+"/final.parquet")

Here is what I saw from the printSchema, it shows as geometry type, but the nullable is true seems like this is expected. Correct me if this is wrong.

root
 |-- geometry: geometry (nullable = true)
 |-- geohash: string (nullable = true)

3, I got an error when I am using following way to read the geoparquet from #2

df = sedona.read.format("geoparquet").load(newpath)

Here is the error details

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 22.0 failed 4 times, most recent failure: Lost task 0.3 in stage 22.0 (TID 115) (10.139.64.12 executor 0): java.lang.IllegalArgumentException: GeoParquet file does not contain valid geo metadata

But there is read error if I use following code, but no geo metadata cound be found from df schema

df = sedona.read.format("geoparquet").parquet(newpath)

Settings

Sedona version = 1.5.0

Apache Spark version = 3.4.0

Apache Flink version = N/A

API type = Python

Scala version = 2.12

JRE version = 1.8

Python version = 3.10

Environment = Azure Databricks, notebook

jiayuasu commented 1 year ago

@FengJiang2018 I think this is probably because Databricks Spark use different internal APIs for data sources, compared to open-source Apache Spark. I just tested the Sedona geoparquet reader using our docker image: https://hub.docker.com/r/apache/sedona It works fine. Could you let me know what Databricks runtime version you are using and what Sedona version you are using?

Do you mind contacting my email (jiayu@apache.org)?

FengJiang2018 commented 1 year ago

Disabled photon acceleration option on the cluster solved the read/write problem. Looking forwarding to see it could be supported in future as Photon has lots of perf gain.