OvertureMaps / data

Overture Maps Data
https://docs.overturemaps.org
948 stars 37 forks source link

unable to pull Overture 2023-11-14-alpha.0 (places) #89

Closed safegraph-clarissa closed 2 months ago

safegraph-clarissa commented 10 months ago

seems that the reader im using cant interpret the geometry column like it has in the past. this is what i used to originally pull the oct release and it worked perfect:

import pyspark.sql.functions as F
october = (spark.read.parquet("s3://overturemaps-us-west-2/release/2023-10-19-alpha.0/theme=places").withColumn("location_name", F.col("names")["common"][0]["value"]).cache() 
)

when i try using the above with the nov release, i get an error message: Screenshot 2023-11-14 at 1 06 50 PM

i tried running this instead:

import pyspark.sql.functions as F
import sedona

from sedona.register import SedonaRegistrator
from sedona.utils import SedonaKryoRegistrator, KryoSerializer
SedonaRegistrator.registerAll(spark)

#import org.apache.spark.sql.functions._
november = (spark.read.parquet("s3://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/*/").withColumn("location_name", F.col("names")["common"][0]["value"]).withColumn('geometry', f.expr("ST_GeomFromWKB(geometry)")).cache() 
)

but still getting an error message:

FileReadException: Error while reading file s3://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/type=place/part-00010-6cb89013-4ec2-4b94-8e4b-8e27c7d30865.c000.zstd.parquet. Parquet column cannot be converted. Column: [geometry], Expected: ArrayType(ByteType,false), Found: BINARY

could be something on my end, but were there changes to the geometry column and/or is there a way to confirm the column is encoded correctly?

ibnt1 commented 10 months ago

can you please try using the updated instructions from the Sedona section: https://github.com/OvertureMaps/data#4-apache-sedona-python--spatial-sql use sedona.read.format("geoparquet") to read the geoparquet and you should no longer need to use ST_GeomFromWKB(), you can apply the spatial functions directly to the geometry column, for example:

import pyspark.sql.functions as F
from sedona.spark import *

config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)

november = (sedona.read.format("geoparquet").load("s3://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/*/")\
    .withColumn("location_name", F.col("names")["common"][0]["value"])\
        .select("location_name", "geometry")\
            .cache())
november.printSchema()
november.filter("ST_Contains(ST_GeomFromWKT('POLYGON((-122.48 47.43,-122.20 47.75,-121.92 47.37,-122.48 47.43))'), geometry) = true").show(5)
root
 |-- location_name: string (nullable = true)
 |-- geometry: geometry (nullable = true)

+--------------------+--------------------+
|       location_name|            geometry|
+--------------------+--------------------+
|Core Centric Pers...|POINT (-122.46137...|
|Vashon Island Her...|POINT (-122.46331...|
|    The Brown Agency|POINT (-122.46137...|
|        Raven's Nest|POINT (-122.46028...|
|      Vashon Library|POINT (-122.45953...|
+--------------------+--------------------+
only showing top 5 rows
jwass commented 10 months ago

@safegraph-clarissa Can you let us know if this fixed the issue for you?

jwass commented 9 months ago

@ibnt1 @FengJiang2018 @jiayuasu

To what extent does the Spark Metadata play a role here? I noticed that there's additional metadata that gets written that looks like it changed from October to Nov. Here's something that stuck out to me. You can see what changed for the geom column between the last two releases.

import json

import pyarrow.parquet as pq
from pyarrow.fs import S3FileSystem

fs = S3FileSystem(anonymous=True, region="us-west-2")

def get_geometry_metadata(s3_path):
    ds = pq.ParquetDataset(s3_path, filesystem=fs)

    # Just grab the first file
    pqf = pq.ParquetFile(ds.files[0], filesystem=fs)
    spark_metadata = json.loads(pqf.metadata.metadata[b'org.apache.spark.sql.parquet.row.metadata'])

    geom = [f for f in spark_metadata["fields"] if f["name"] == "geometry"]
    return geom[0]

october_geom = get_geometry_metadata("overturemaps-us-west-2/release/2023-10-19-alpha.0")
november_geom = get_geometry_metadata("overturemaps-us-west-2/release/2023-11-14-alpha.0")
In [3]: october_geom
Out[3]: {'name': 'geometry', 'type': 'binary', 'nullable': True, 'metadata': {}}

In [4]: november_geom
Out[4]:
{'name': 'geometry',
 'type': {'type': 'udt',
  'class': 'org.apache.spark.sql.sedona_sql.UDT.GeometryUDT',
  'pyClass': 'sedona.sql.types.GeometryType',
  'sqlType': 'binary'},
 'nullable': True,
 'metadata': {}}
Kontinuation commented 9 months ago

Spark primarily uses org.apache.spark.sql.parquet.row.metadata to infer the schema of parquet files. It will fall back to using the native parquet schema only when org.apache.spark.sql.parquet.row.metadata is absent.

A breaking change was made to the sqlType of geometry since sedona 1.4.0 (changed from array[byte] to binary), thus introducing incompatibility of spark metadata stored in parquet files. This makes earlier versions of sedona have problems reading parquet files written by the recent versions of sedona. For geoparquet writers in sedona, we can change the metadata of geometry columns to have binary type instead of geometry to avoid compatibility issues. @jiayuasu

jiayuasu commented 9 months ago

@jwass @ibnt1 @FengJiang2018 In short, this issue has nothing to do with the released parquet or geoparquet files from OMF. You are on the safe side. Any Sedona 1.4.0+ users should be able to read the data using sedona.read.format("geoparquet") OR spark.read.parquet("XXX") + ST_GeomFromWKB. Any Spark users (no Sedona installed) will be able to read the data using spark.read.parquet("XXX")

SafeGraph (@safegraph-clarissa) has been using Sedona for a very long time. Based on my previous conversation with them, they were using an old version of Sedona (likely <= Sedona 1.2.1-incubating). Before Sedona 1.4.0, there was a minor issue when people directly write Sedona Geometry type to parquet files without using ST_AsWKB. We fixed that in Sedona 1.4.0 by introducing a breaking change.

If SafeGraph upgrades their Sedona version to 1.4.0+, the issue of reading Overture 2023-11-14-alpha.0 will be gone (both sedona.read.format("geoparquet") AND spark.read.parquet("XXX") + ST_GeomFromWKB will work). However, they might need to rewrite their old internal data using the new Sedona version. We will provide a compatibility mode in the new Sedona release to help old Sedona users migrate to the new version.

jwass commented 9 months ago

@Kontinuation @jiayuasu I think I get it now. Thank you for the explanation.

I still wonder if Overture should remove all the Spark metadata from our published files but that can be a separate discussion.

ilanrich commented 9 months ago

I am running into similar issues when reading in the latest release. I am running Spark 3.3 and Sedona 1.4.1.

I followed the instructions running

from sedona.spark import *

config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet").load("s3a://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/type=place")`

when I run display(df) or try to write the file I am getting error: java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.parquetFilterPushDownStringStartWith()Z

Has anyone encountered the above error?

Kontinuation commented 9 months ago

I am running into similar issues when reading in the latest release. I am running Spark 3.3 and Sedona 1.4.1.

I followed the instructions running

from sedona.spark import *

config = SedonaContext.builder().config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider").getOrCreate()
sedona = SedonaContext.create(config)

df = sedona.read.format("geoparquet").load("s3a://overturemaps-us-west-2/release/2023-11-14-alpha.0/theme=places/type=place")`

when I run display(df) or try to write the file I am getting error: java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.parquetFilterPushDownStringStartWith()Z

Has anyone encountered the above error?

I've seen this issue before. This happens when using Spark 3.3.0 on Databricks due to binary incompatibility with Apache Spark. Spark 3.3.1, 3.3.2 on Databricks do not have this problem. We can address this incompatibility with DBR Spark 3.3.0 in future releases.

Another thing to mention is that the GeoParquet reader of Apache Sedona does not work with Photon, it only works on DBR clusters with Photon disabled.

jiayuasu commented 7 months ago

@jwass This issue has been fixed. If OMF upgrades Sedona version to 1.5.1, then the files written by Sedona 1.5.1 can be read by any older version of Sedona.

jwass commented 7 months ago

@jwass This issue has been fixed. If OMF upgrades Sedona version to 1.5.1, then the files written by Sedona 1.5.1 can be read by any older version of Sedona.

Thanks! I'll forward on to see if it can be upgrade by next release.

jwass commented 2 months ago

Closed as fixed