azavea / osmesa

OSMesa is an OpenStreetMap processing stack based on GeoTrellis and Apache Spark
Apache License 2.0
80 stars 26 forks source link

Initial vector tile generation #32

Closed jpolchlo closed 6 years ago

jpolchlo commented 6 years ago

This PR represents the current progress of this tool toward being able to produce a catalog of vector tiles. We have produced some vector tiles using this feature branch, but cannot guarantee correctness of the results. Larger-scale runs will be necessary to extend our capabilities in this regard, but this PR represents our current status.

jpolchlo commented 6 years ago

This most recent commit adds clipping to the 3x3 spatial key neighborhood. The visualized results are much clearer. Performance isn't crazily different compared to the unclipped version—at least on a small input. I've also added a 500 millisecond timeout on the intersection process to ensure that any wacky geometries that cause problems don't ham up the works completely. These failing geometries will trigger a log message giving both the geometry and the extent we were trying to clip to (this will be the 3x3 extent).

mojodna commented 6 years ago

Fantastic! I'm looking forward to having time to give this a shot.

jpolchlo commented 6 years ago

Added some basic pyramiding functionality. I don't recommend running this. I had some stack overflow issues during some runs on a small data set. There are only slight benefits to running this code, since the size of the data generally does not shrink as we pyramid. Would be better to allow caching of the initial processing of the OSM results to orc files? Or something?

mojodna commented 6 years ago

I don't think it's in this PR directly, but vectorpipe@0.2.2 hasn't been published yet, so it needs to be publishLocal'd prior to successful building of the ingest assembly.

mojodna commented 6 years ago

When producing tiles for analysis purposes (i.e. full geometries), it seems to me that generating multiple zoom levels isn't all that useful. In practice, we'd want to generate the lowest zoom for which individual tile sizes are reasonable (for generating or consuming) and screen coordinates aren't lossy (e.g. the default tileWidth (4096) is good for <zoom> + 3 @ retina resolution (512 * 2**3)). To produce output at higher zooms, over-zooming (read subsets of individual tiles) works just as well as reading higher zoom tiles (with the precision caveat).

Generating lower zoom tiles will be necessarily lossy, as we'll want to simplify lines + polygons and selectively drop points in order to keep the tiles to a reasonable file size.

(I'm on the fence about how geometries should be clipped for analysis tiles (and can't remember off-hand how the Mapbox QA tiles are clipped).)

When generating vector tiles for cartographic purposes (i.e. w/ the OpenMapTiles or Mapzen schemas), pyramiding definitely makes sense, although the rules for simplifying + dropping features should be explicitly curated (i.e. @ zoom 10, only show points w/ place=*, lines w/ highway={motorway,primary}, buildings (building=*) larger than x m², ...). (Rules for clipping buffers too, I think).

Would be better to allow caching of the initial processing of the OSM results to orc files?

I think so; I've been storing the node + way geoms on S3 separately and running processing jobs against them (e.g. region stats + changeset stats).

jpolchlo commented 6 years ago

I'm going to merge these changes in a while if there are no objections.

jpolchlo commented 6 years ago

Oh, and in response to your comments, @mojodna, it appears that any pyramiding work should take place in another PR since it's surely a nontrivial feature. But I copy; over and out.

ARolek commented 6 years ago

This PR was brought to my attention to see if any of the work we have done on the tegola project could be of assistance. My understanding is that the goal is to generate vector tiles at full fidelity for analysis purposes. I agree with @mojodna that maybe generating multiple zooms is not necessary for this use case and simplification should probably be avoided. With tegola we have dealt with a lot of simplification issues which result in invalid polygons / multi polygons which are difficult to fix.

The main advantages I see to using a tiling approach for analysis is filtering features by zoom and then "querying" an area to stitch back together for comparison against another dataset of the same area. The algos for clipping, scaling, re-projection and fixing invalid polygons vary for different implementations so the same processing should be applied to both datasets. For tegola, we have a seeding tool and you can turn off simplification if that would be of any help for cross referencing some of this work.

Also worth noting is that MVT tiles have a buffer that needs to be considered when stitching tiles back together. Speaking of which, what's the strategy for tile stitching prior to comparison?