Closed safegraph-clarissa closed 2 months ago
can you please try using the updated instructions from the Sedona section: https://github.com/OvertureMaps/data#4-apache-sedona-python--spatial-sql
use sedona.read.format("geoparquet")
to read the geoparquet and you should no longer need to use ST_GeomFromWKB()
, you can apply the spatial functions directly to the geometry
column, for example:
import pyspark.sql.functions as F
from sedona.spark import *
config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)
november = (sedona.read.format("geoparquet").load("s3://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/*/")\
.withColumn("location_name", F.col("names")["common"][0]["value"])\
.select("location_name", "geometry")\
.cache())
november.printSchema()
november.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-122.48 47.43,-122.20 47.75,-121.92 47.37,-122.48 47.43))'), geometry) = true").show(5)
root
|-- location_name: string (nullable = true)
|-- geometry: geometry (nullable = true)
+--------------------+--------------------+
| location_name| geometry|
+--------------------+--------------------+
|Core Centric Pers...|POINT (-122.46137...|
|Vashon Island Her...|POINT (-122.46331...|
| The Brown Agency|POINT (-122.46137...|
| Raven's Nest|POINT (-122.46028...|
| Vashon Library|POINT (-122.45953...|
+--------------------+--------------------+
only showing top 5 rows
@safegraph-clarissa Can you let us know if this fixed the issue for you?
@ibnt1 @FengJiang2018 @jiayuasu
To what extent does the Spark Metadata play a role here? I noticed that there's additional metadata that gets written that looks like it changed from October to Nov. Here's something that stuck out to me. You can see what changed for the geom column between the last two releases.
import json
import pyarrow.parquet as pq
from pyarrow.fs import S3FileSystem
fs = S3FileSystem(anonymous=True, region="us-west-2")
def get_geometry_metadata(s3_path):
ds = pq.ParquetDataset(s3_path, filesystem=fs)
# Just grab the first file
pqf = pq.ParquetFile(ds.files[0], filesystem=fs)
spark_metadata = json.loads(pqf.metadata.metadata[b'org.apache.spark.sql.parquet.row.metadata'])
geom = [f for f in spark_metadata["fields"] if f["name"] == "geometry"]
return geom[0]
october_geom = get_geometry_metadata("overturemaps-us-west-2/release/2023-10-19-alpha.0")
november_geom = get_geometry_metadata("overturemaps-us-west-2/release/2023-11-14-alpha.0")
In [3]: october_geom
Out[3]: {'name': 'geometry', 'type': 'binary', 'nullable': True, 'metadata': {}}
In [4]: november_geom
Out[4]:
{'name': 'geometry',
'type': {'type': 'udt',
'class': 'org.apache.spark.sql.sedona_sql.UDT.GeometryUDT',
'pyClass': 'sedona.sql.types.GeometryType',
'sqlType': 'binary'},
'nullable': True,
'metadata': {}}
Spark primarily uses org.apache.spark.sql.parquet.row.metadata
to infer the schema of parquet files. It will fall back to using the native parquet schema only when org.apache.spark.sql.parquet.row.metadata
is absent.
A breaking change was made to the sqlType of geometry since sedona 1.4.0 (changed from array[byte] to binary), thus introducing incompatibility of spark metadata stored in parquet files. This makes earlier versions of sedona have problems reading parquet files written by the recent versions of sedona. For geoparquet writers in sedona, we can change the metadata of geometry columns to have binary
type instead of geometry
to avoid compatibility issues. @jiayuasu
@jwass @ibnt1 @FengJiang2018 In short, this issue has nothing to do with the released parquet or geoparquet files from OMF. You are on the safe side. Any Sedona 1.4.0+ users should be able to read the data using sedona.read.format("geoparquet")
OR spark.read.parquet("XXX") + ST_GeomFromWKB
. Any Spark users (no Sedona installed) will be able to read the data using spark.read.parquet("XXX")
SafeGraph (@safegraph-clarissa) has been using Sedona for a very long time. Based on my previous conversation with them, they were using an old version of Sedona (likely <= Sedona 1.2.1-incubating). Before Sedona 1.4.0, there was a minor issue when people directly write Sedona Geometry type to parquet files without using ST_AsWKB. We fixed that in Sedona 1.4.0 by introducing a breaking change.
If SafeGraph upgrades their Sedona version to 1.4.0+, the issue of reading Overture 2023-11-14-alpha.0 will be gone (both sedona.read.format("geoparquet")
AND spark.read.parquet("XXX") + ST_GeomFromWKB
will work). However, they might need to rewrite their old internal data using the new Sedona version. We will provide a compatibility mode in the new Sedona release to help old Sedona users migrate to the new version.
@Kontinuation @jiayuasu I think I get it now. Thank you for the explanation.
I still wonder if Overture should remove all the Spark metadata from our published files but that can be a separate discussion.
I am running into similar issues when reading in the latest release. I am running Spark 3.3 and Sedona 1.4.1.
I followed the instructions running
from sedona.spark import *
config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)
df = sedona.read.format("geoparquet").load("s3a://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/type=place")`
when I run display(df) or try to write the file I am getting error: java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.parquetFilterPushDownStringStartWith()Z
Has anyone encountered the above error?
I am running into similar issues when reading in the latest release. I am running Spark 3.3 and Sedona 1.4.1.
I followed the instructions running
from sedona.spark import * config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate() sedona = SedonaContext.create(config) df = sedona.read.format("geoparquet").load("s3a://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/type=place")`
when I run display(df) or try to write the file I am getting error:
java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.parquetFilterPushDownStringStartWith()Z
Has anyone encountered the above error?
I've seen this issue before. This happens when using Spark 3.3.0 on Databricks due to binary incompatibility with Apache Spark. Spark 3.3.1, 3.3.2 on Databricks do not have this problem. We can address this incompatibility with DBR Spark 3.3.0 in future releases.
Another thing to mention is that the GeoParquet reader of Apache Sedona does not work with Photon, it only works on DBR clusters with Photon disabled.
@jwass This issue has been fixed. If OMF upgrades Sedona version to 1.5.1, then the files written by Sedona 1.5.1 can be read by any older version of Sedona.
@jwass This issue has been fixed. If OMF upgrades Sedona version to 1.5.1, then the files written by Sedona 1.5.1 can be read by any older version of Sedona.
Thanks! I'll forward on to see if it can be upgrade by next release.
Closed as fixed
seems that the reader im using cant interpret the geometry column like it has in the past. this is what i used to originally pull the oct release and it worked perfect:
when i try using the above with the nov release, i get an error message:
i tried running this instead:
but still getting an error message:
could be something on my end, but were there changes to the geometry column and/or is there a way to confirm the column is encoded correctly?