apache / incubator-baremaps

Create custom vector tiles from OpenStreetMap and other data sources with Postgis and Java.
baremaps.apache.org
Apache License 2.0
490 stars 56 forks source link

Add support for Overturemap parquet files #849

Closed bchapuis closed 3 weeks ago

bchapuis commented 2 months ago

https://github.com/OvertureMaps/data

bchapuis commented 2 months ago

@sebr72 as discussed, I'm not really satisfied with my current experiment in the overturemap branch. The geoparquet format contains semi structured data which require some changes in the DataTable abstraction. Also, it requires a deep understanding of the geoparquet format.

One avenue (probably the best) could be to use the parser available in sedona (the project is written in scala): https://sedona.apache.org/latest-snapshot/tutorial/sql/#__tabbed_9_2

Another avenue could be to build upon my throw-away overturemaps branch, but I'm not sure about the effort needed to have something robust.

In both cases, adding parquet or sedona will result in a lot of new dependencies (hadoop, spark).

bchapuis commented 2 months ago

@sebr72 There may also be a third option which is to rely on parquet support in postgresql. I have no experience with this extention. https://github.com/adjust/parquet_fdw

sebr72 commented 1 month ago

@bchapuis I had a look at Sedona and I highlight the following:

  1. Large project mainly relying on Spark or Flink (large project themselves)
  2. The java examples are around Flink which is a lot faster than Spark but it is not directly linked to Geoparquet
  3. The geoparquet implementation is in Scala and geared around Spark
  4. Combining Spark Sedona and Scala with baremaps will end up in an "expensive" integration for Geoparquet.

I am going to switch to have a look at: https://github.com/apache/drill/tree/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet

bchapuis commented 1 month ago

Yes, I think the suggestion of @Drabble to look into drill is a good idea. We can probably either use it or get inspiration from it for our own implementation.

bchapuis commented 1 month ago

@sebr72 @Drabble I will merge the current PR and organize the git history to have three separated commits with our individual contributions. For the following tasks, I suggest we make individual PRs and split the work more clearly.

bchapuis commented 1 month ago

@sebr72 @Drabble I merged the changes and we can now continue with individual PRs.

Drabble commented 1 month ago

@bchapuis Great job on the pull request! I will look at your new one for nested groups.

I would be really interested in making an example to go from Overture data on S3 to serving MVT to a Maputnik frontend.

I think this would mean:

1 Fix the code to be able to use a S3 url directly. E.g. s3a://overturemaps-us-west-2/release/2024-05-16-beta.0/theme=admins/type=/

  1. Use the GeoParquetDataTable to write Overture data into Postgresql using a ProjectionTransformer to go from EPSG:4326 to EPSG:3857
  2. Create a geospatial index for the geometry column
  3. Create a materialised view to group the columns into a TAGS jsonb field and maybe simplifications for different zoom levels
  4. Make a simple style.json and tileset.json to serve the data

What do you think?

bchapuis commented 1 month ago

Yes, the plan sounds good and can probably be addressed with multiple PRs. Maybe we can skip step 4 or use views instead of materialized views. As the daylight distribution with soon be deprecated and replaced by overturemaps, an idea could be to copy the daylight directory and use it as a basis.

Drabble commented 3 weeks ago

We have a basic support for Overture maps now. Should we consider this issue closed and raise more issues for further improvements to the Overture maps library?

fgravin commented 3 weeks ago

Congratz guys :tada: