geotrellis / vectorpipe

Convert Vector data to VectorTiles with GeoTrellis.
https://geotrellis.github.io/vectorpipe/
Other
74 stars 20 forks source link

OSM => VectorTiles :: (4.5) Geometry Clipping #11

Closed fosskers closed 7 years ago

fosskers commented 7 years ago

The new toGrid method injected into RDD[OSMFeature] asks for a clip function which performs some "clipping strategy" on each Geometry. Strategies which immediately come to mind are:

  1. Clip directly on the bounding box
  2. Clip just outside the bounding box
  3. Keep the nearest Point outside the bounding box, wherever it is
  4. Custom clipping for each OSM Element type (building, etc)
  5. Don't clip

Of which (1), (2) and (5) are trivial, (4) could be adopted into any other, and (3) requires a lot of consideration.

While a user who uses VectorPipe as a tool is free to use any strategy they wish, (3) has been elected as the priority for Azavea's own use of VectorPipe. This comes with an up-front engineering cost, the particulars of which will be discussed in this issue.

Images to come. Strat 1 and 2 will be treated as the same, as their implementations (and results) are nearly identical.

fosskers commented 7 years ago

@echeipesh @lossyrob

fosskers commented 7 years ago

All five strategies reduce to identity in the case of a single Point. A Point outside of a given SpatialKey's bounding box never would have been associated with that key in the first place.

point

JTS clipping behaves as expected when the Point lies directly on the box's border or corner.

fosskers commented 7 years ago

Simple line. The left result is from applying Strat 1 (or 2, called "cut"). The right, from Strat 3 (called "Spider points"). In either case, the representation as a GeoTrellis Line or VectorTile LineString is unambiguous.

lines-01

Strat 5 should be avoided here for lines that are known to span many tiles (country borders? highways?).

fosskers commented 7 years ago

Lines which leave the tile and come back. We can imagine roads and rivers doing this all the time in the wild. Representation as either GT or VT types is still unambiguous.

lines-02

Of note is the bottom left, where two inner Points share an outer.

fosskers commented 7 years ago

Lines which pass through, but have no Points within the bounding box. These will have been associated with the given SpatialKey properly. Representation is still unambiguous.

lines-03

fosskers commented 7 years ago

Simple polygon. Here, representations finally diverge. Cut yields a Polygon, while Spider yields a Line.

poly-01

Were the polygon a building, it would still be labelled as such in its metadata. The Line here would be relegated to a buildings layer within a VT, and a user doing analysis on such a layer would have to keep in mind that there might be "liar" Lines who started as Polygons and actually represent a part of a building.

fosskers commented 7 years ago

Except in this case, we'd expect Spider to retain the whole Polygon.

poly-03

fosskers commented 7 years ago

Two Polygons, one holed, one solid, take the same shape via Spider. I'm not yet sure how much of a problem case this is, because it seems like the same assumptions made two comments up about user overhead would hold. Except now you'd have two separate LineStrings with the same metadata who would have to be considered the same by the user application.

poly-02

EDIT: The Lines here would be made part of the same VT Feature, so reassociation wouldn't be hard.

fosskers commented 7 years ago

This time, Spider's poly would be lying while thinking it was telling the truth. Its metadata would report "I'm a building!", and the poly itself would be stored in the buildings layer as a VT POLYGON. An analysis query saying "give me the surface area of all buildings" would return the wrong result.

Encoding

With VectorTiles, POLYGON exteriors are encoded with clockwise winding order. Interiors are encoded with counterclockwise order. Unfortunately, in the actual encoding it's illegal to have a "stray" interior that wasn't preceded by an exterior, so we couldn't cheat and leave the original interior here "as is".

This opens the question: For any given Polygon produced by Spider, how do you know whether it was an original Exterior or Interior? I am extremely hesitant to abuse VT metadata to make this distinction. I'm getting the feeling that we should almost never clip Polygons, except large bodies of water and country borders.

poly-04

fosskers commented 7 years ago

Recall: type OSMFeature = Feature[Geometry, Tree[ElementData]]

It seems that the type signature for any clipping function should actually be:

clip: (SpatialKey, OSMFeature) => (SpatialKey, Seq[Geometry], Tree[ElementData])

since a clip may (as seen above) produce multiple child Geoms that we'd wish to store in the same VT Feature. This would preserve the metadata across the split geoms, reducing redundancy. Note that the actual types of the Seq[Geometry] would be:

and not their Multi* counterparts, as we cannot represent those in VTs.

fosskers commented 7 years ago

Given that every Geom coming from OSM has metadata, we'll have to do one VT Feature per Geom (or clipped geom collection). This is good, as it separates the metadata nicely. So for instance, a VT with buildings layer would have a Feature for each building polygon or clipped set of linestrings, etc.

mojodna commented 7 years ago

Given that every Geom coming from OSM has metadata, we'll have to do one VT Feature per Geom (or clipped geom collection). This is good, as it separates the metadata nicely.

In addition to the metadata from the source element, it's probably worth marking what OSM data type it was generated from (in part because IDs are only unique per type, not globally). (Thinking about the ramifications of relations in particular here, with the assumption that they're mapped to Multi* types.)

fosskers commented 7 years ago

The advantage of Spider is that "new data" isn't created. Every Point in every Geometry originated from some OSM Node. Consider the following output from Spider (the top result):

poly-05

Analysis Query: Give me the number of buildings. Easy, just count the number of Features in the buildings layer, regardless of Geom type. This is O(1).

Analysis Query: Give me the surface area of all buildings. Options:

  1. Somehow reconstruct the closest approximation to the original building, as Cut would have done (bottom result). This may be hard, if the original relationship of the lines is unknown, as in a few comments up.
  2. Attempt to reconstruct the original Geometry via stitching with other Features from neighbouring Tiles. This feels like an O(n^2) problem, as it's unknown ahead of time how complex the rest of the Polygon is outside this current bounding box, and into how many Tiles it stretches.

Both of these feel unsatisfactory, making me further believes that buildings shouldn't be clipped.

fosskers commented 7 years ago

it's probably worth marking what OSM data type it was generated from

That's reasonable, yeah. Especially given that the following Way would also produce the same output via Spider:

poly-06

fosskers commented 7 years ago

Note to self: VTs pulled from Mapbox are all gzipped.

fosskers commented 7 years ago

Mapbox clips their polys in a 3x3 extent area around the current tile.

mapbox-tiles

The labels follow Slippy Map format, so these are from zoom level 16.

fosskers commented 7 years ago

For most Polygons, clipping in a 3x3-extent square around the current Tile will suffice to capture the entire Poly. Both Tilezen and Mapbox do this for (at least) their display tiles.

The corner case that remains the thorn in our side is the "holed island" situation, where a giant multipolygon spanning many many tiles (Lake Superior, etc) has an island inside of it, and is represented by OSM as an inner ring. There is no way to encode a "naked" inner ring into a VectorTile, so we chose to give holes like this a shell, following the same 3x3 clipping rule as before. The intention was then to design an algorithm that could know which "fake" Points to remove when restitching, before performing some analytics. Consider this island, found in Pelican Lake in Manitoba:

tiny-island

At zoom level 16, the north edge of the lake is 41 tiles to the north of the Feature shown here.

The issue here is that unless you have all the data in scope, there isn't a way to completely reconstruct the largest polygons without leaving in some "fake" Points. In general, the task "I have a Tile with a clipped Polygon. what other tiles do I need to query to be able to fully reconstruct it?" should be considered "hard".

But! What if every VT Feature also stored its full original bounding envelope? Then you should easily be able to get a KeyBounds for it, and fetch all the Tiles within that grid. Then Polygon reconstruction would be a bit easier.

lossyrob commented 7 years ago

What if every VT Feature also stored its full original bounding envelope?

Sounds reasonable to me.

fosskers commented 7 years ago

Should be easy enough to do with VT Feature metadata. xmin, xmax, etc. Small to store, since it's only four extra points.

fosskers commented 7 years ago

We desire a function: toNearestPoint(extent: Extent, line: Line): MultiLine. I implemented it three different ways, to both see which approach was best in general, and to prod at Scala's capabilities. The three implementations were:

Some benchmarks:

 Running benchmarks for toNearestPoint - ALL IN - two-point line...
 Name                                    us  linear runtime
 Tail Recursion                        2.74  ====
 foldLeftM                            11.44  =================
 Java Style                            5.20  =======

 Running benchmarks for toNearestPoint - ALL IN - short line...
 Name                                    us  linear runtime
 Tail Recursion                        6.94  ====
 foldLeftM                            28.90  ====================
 Java Style                            7.98  =====

 Running benchmarks for toNearestPoint - ALL IN - medium line...
 Name                                    us  linear runtime
 Tail Recursion                       85.07  ==========
 foldLeftM                           152.91  ==================
 Java Style                           33.31  ====

 Running benchmarks for toNearestPoint - MOST OUT - short line...
 Name                                    us  linear runtime
 Tail Recursion                        9.22  ====
 foldLeftM                            39.66  =================
 Java Style                            7.07  ===

 Running benchmarks for toNearestPoint - MOST OUT - medium line...
 Name                                    us  linear runtime
 Tail Recursion                       89.45  ============
 foldLeftM                           136.81  ===================
 Java Style                           22.07  ===

We see TailRec generally outperform Java, until the 200-Point case, where most of the points are outside of the Extent. It's unclear exactly why TailRec gets so bad.

If we knew ahead of time what the average OSM Way length was, we might be able to make a more nuanced decision here about TailRec/Java. Clearly foldLeftM is out of the question. Unfortunately the Foldable typeclass from Cats doesn't expose foldLeftM yet (it's in cats-free), so it's unclear if their solution would perform better.

fosskers commented 7 years ago

None of the above three implementations yet account for the Corner Case of Doom™:

lines-04

Since the line intersects this Tile, it would be associated to this Tile's SpatialKey by the gridding algorithm. The question is, how should we check for this case?

One method would be:

  1. Detected two consecutive Points (last and curr) which lie outside the Extent
  2. Form Line(last, curr)
  3. Does that Line intersect the Extent? If so, save last to the current Point buffer and continue.

This does what's expected for this immediate case plus a number of similar sister edge cases. I've implemented this for the Java variant of toNearestPoint (since it was the fastest). The test suite now passes, but the benchmark results aren't promising:

 Running benchmarks for toNearestPoint - ALL IN - two-point line...
 Name                                    us  linear runtime
 Java Style                            5.58  ======
 Java Style - Robust                   5.49  ======

 Running benchmarks for toNearestPoint - ALL IN - short line...
 Name                                    us  linear runtime
 Java Style                           10.07  =====
 Java Style - Robust                   8.79  ====  <-- likely a fluke

 Running benchmarks for toNearestPoint - ALL IN - medium line...
 Name                                    us  linear runtime
 Java Style                           48.41  ======
 Java Style - Robust                  45.40  ======

 Running benchmarks for toNearestPoint - MOST OUT - short line...
 Name                                    us  linear runtime
 Java Style                            9.92  =
 Java Style - Robust                  29.74  =====

 Running benchmarks for toNearestPoint - MOST OUT - medium line...
 Name                                    us  linear runtime
 Java Style                           19.50  =
 Java Style - Robust                 270.75  =======================

I claim it's a very common case for most of the points of a Way to be outside the current Extent. Therefore the MOST OUT timings should be paid attention to most. Unless we can think of another way to check for this edge case, I recommend we avoid this check altogether, and don't attempt to save "pass through" line segments like this.

lines-06

Here, the Line could fly off the north west "forever", and we'd be wasting processing time checking each pair of Points. Is there some heuristic we could apply to cut out sections of the Line that obviously won't have candidate Points?

lossyrob commented 7 years ago

I think there's a way to check the intersection of the segment rather than build a Line and do an intersection operation. Remember, there's a lot of overhead for creating geometries and the Scala wrappers we have in geotrellis.vector, so whenever that can be avoided there is a huge win. So my guess is if we profiled most of the time would be spent in

  1. Form Line(last, curr)

So how do we form a quick segment intersection here, in the case both points lie outside? I'm imagining some horrific if/else block that determines which of the 9 sections the last point lives in:

screen shot 2017-03-09 at 7 52 38 pm

it starts with,

if(`last point is in 1`) {
   if(`curr point is in 7, 5, or 8`) { `it cuts` }
} else if(`last point is in 2`) {
  if(`curr poit is in 4, 5, 6, 7, or 8`) { `it cuts` }
} else if { `...` }

all done by checking x and y of the points with xmin, ymin, xmax, and ymax of the extent. Not a fun method to write, but I image will improve the performance you're seeing.

fosskers commented 7 years ago

Thoughts on drilling down to the JTS level and testing intersects there instead?

lossyrob commented 7 years ago

Intersects should be on our Extent, which avoids the JTS polygon intersects for quick method: https://github.com/locationtech/geotrellis/blob/master/vector/src/main/scala/geotrellis/vector/Extent.scala#L132

fosskers commented 7 years ago

For the point-based intersects I am doing that already, luckily. I didn't know it avoids JTS there, so that's a bonus.

fosskers commented 7 years ago

I got around the problem with this: https://github.com/geotrellis/vectorpipe/blob/c5e44182a5722f5e957fd0060d87c26f6fc98576/src/main/scala/vectorpipe/geom/Clip.scala#L49

There was only a minor performance hit:

 Running benchmarks for toNearestPoint - MOST OUT - medium line...
 Name                                    us  linear runtime
 Java Style                           21.97  =
 Java Style - Robust                 240.92  ===================
 Java Style - Robuster                25.67  ==