Closed fosskers closed 7 years ago
@echeipesh @lossyrob
All five strategies reduce to identity
in the case of a single Point
. A Point
outside of a given SpatialKey
's bounding box never would have been associated with that key in the first place.
JTS clipping behaves as expected when the Point
lies directly on the box's border or corner.
Simple line. The left result is from applying Strat 1 (or 2, called "cut"). The right, from Strat 3 (called "Spider points"). In either case, the representation as a GeoTrellis Line
or VectorTile LineString
is unambiguous.
Strat 5 should be avoided here for lines that are known to span many tiles (country borders? highways?).
Lines which leave the tile and come back. We can imagine roads and rivers doing this all the time in the wild. Representation as either GT or VT types is still unambiguous.
Of note is the bottom left, where two inner Point
s share an outer.
Lines which pass through, but have no Point
s within the bounding box. These will have been associated with the given SpatialKey
properly. Representation is still unambiguous.
Simple polygon. Here, representations finally diverge. Cut yields a Polygon
, while Spider yields a Line
.
Were the polygon a building, it would still be labelled as such in its metadata. The Line
here would be relegated to a buildings
layer within a VT, and a user doing analysis on such a layer would have to keep in mind that there might be "liar" Line
s who started as Polygon
s and actually represent a part of a building.
Except in this case, we'd expect Spider to retain the whole Polygon
.
Two Polygon
s, one holed, one solid, take the same shape via Spider. I'm not yet sure how much of a problem case this is, because it seems like the same assumptions made two comments up about user overhead would hold. Except now you'd have two separate LineStrings with the same metadata who would have to be considered the same by the user application.
EDIT: The Line
s here would be made part of the same VT Feature
, so reassociation wouldn't be hard.
This time, Spider's poly would be lying while thinking it was telling the truth. Its metadata would report "I'm a building!", and the poly itself would be stored in the buildings
layer as a VT POLYGON
. An analysis query saying "give me the surface area of all buildings" would return the wrong result.
With VectorTiles, POLYGON
exteriors are encoded with clockwise winding order. Interiors are encoded with counterclockwise order. Unfortunately, in the actual encoding it's illegal to have a "stray" interior that wasn't preceded by an exterior, so we couldn't cheat and leave the original interior here "as is".
This opens the question: For any given Polygon produced by Spider, how do you know whether it was an original Exterior or Interior? I am extremely hesitant to abuse VT metadata to make this distinction. I'm getting the feeling that we should almost never clip Polygons, except large bodies of water and country borders.
Recall: type OSMFeature = Feature[Geometry, Tree[ElementData]]
It seems that the type signature for any clipping function should actually be:
clip: (SpatialKey, OSMFeature) => (SpatialKey, Seq[Geometry], Tree[ElementData])
since a clip may (as seen above) produce multiple child Geoms that we'd wish to store in the same VT Feature
. This would preserve the metadata across the split geoms, reducing redundancy. Note that the actual types of the Seq[Geometry]
would be:
Seq[Point]
(length 1)Seq[Line]
Seq[Polygon]
and not their Multi*
counterparts, as we cannot represent those in VTs.
Given that every Geom coming from OSM has metadata, we'll have to do one VT Feature per Geom (or clipped geom collection). This is good, as it separates the metadata nicely. So for instance, a VT with buildings
layer would have a Feature for each building polygon or clipped set of linestrings, etc.
Given that every Geom coming from OSM has metadata, we'll have to do one VT Feature per Geom (or clipped geom collection). This is good, as it separates the metadata nicely.
In addition to the metadata from the source element, it's probably worth marking what OSM data type it was generated from (in part because IDs are only unique per type, not globally). (Thinking about the ramifications of relations in particular here, with the assumption that they're mapped to Multi*
types.)
The advantage of Spider is that "new data" isn't created. Every Point
in every Geometry originated from some OSM Node
. Consider the following output from Spider (the top result):
Analysis Query: Give me the number of buildings. Easy, just count the number of Feature
s in the buildings
layer, regardless of Geom type. This is O(1)
.
Analysis Query: Give me the surface area of all buildings. Options:
Geometry
via stitching with other Features from neighbouring Tiles. This feels like an O(n^2)
problem, as it's unknown ahead of time how complex the rest of the Polygon is outside this current bounding box, and into how many Tiles it stretches.Both of these feel unsatisfactory, making me further believes that buildings shouldn't be clipped.
it's probably worth marking what OSM data type it was generated from
That's reasonable, yeah. Especially given that the following Way
would also produce the same output via Spider:
Note to self: VTs pulled from Mapbox are all gzipped.
Mapbox clips their polys in a 3x3 extent area around the current tile.
The labels follow Slippy Map format, so these are from zoom level 16.
For most Polygons, clipping in a 3x3-extent square around the current Tile will suffice to capture the entire Poly. Both Tilezen and Mapbox do this for (at least) their display tiles.
The corner case that remains the thorn in our side is the "holed island"
situation, where a giant multipolygon spanning many many tiles (Lake
Superior, etc) has an island inside of it, and is represented by OSM as an
inner ring. There is no way to encode a "naked" inner ring into a
VectorTile, so we chose to give holes like this a shell, following the same
3x3 clipping rule as before. The intention was then to design an algorithm
that could know which "fake" Point
s to remove when restitching, before
performing some analytics. Consider this island, found in Pelican Lake in
Manitoba:
At zoom level 16, the north edge of the lake is 41 tiles to the north of the Feature shown here.
The issue here is that unless you have all the data in scope, there isn't a way to completely reconstruct the largest polygons without leaving in some "fake" Points. In general, the task "I have a Tile with a clipped Polygon. what other tiles do I need to query to be able to fully reconstruct it?" should be considered "hard".
But! What if every VT Feature also stored its full original bounding envelope?
Then you should easily be able to get a KeyBounds
for it, and fetch all
the Tiles within that grid. Then Polygon reconstruction would be a bit
easier.
What if every VT Feature also stored its full original bounding envelope?
Sounds reasonable to me.
Should be easy enough to do with VT Feature metadata. xmin, xmax
, etc. Small to store, since it's only four extra points.
We desire a function: toNearestPoint(extent: Extent, line: Line): MultiLine
. I implemented it three different ways, to both see which approach was best in general, and to prod at Scala's capabilities. The three implementations were:
@tailrec
, collect accumulated Points/Lines w/ immutable collections.foldLeftM
from ScalaZ: the "right" (and shortest) way to do it. We fold over the Array
of Point
s while storing the accumulated results via a State
Monad.ListBuffer
s.Some benchmarks:
Running benchmarks for toNearestPoint - ALL IN - two-point line...
Name us linear runtime
Tail Recursion 2.74 ====
foldLeftM 11.44 =================
Java Style 5.20 =======
Running benchmarks for toNearestPoint - ALL IN - short line...
Name us linear runtime
Tail Recursion 6.94 ====
foldLeftM 28.90 ====================
Java Style 7.98 =====
Running benchmarks for toNearestPoint - ALL IN - medium line...
Name us linear runtime
Tail Recursion 85.07 ==========
foldLeftM 152.91 ==================
Java Style 33.31 ====
Running benchmarks for toNearestPoint - MOST OUT - short line...
Name us linear runtime
Tail Recursion 9.22 ====
foldLeftM 39.66 =================
Java Style 7.07 ===
Running benchmarks for toNearestPoint - MOST OUT - medium line...
Name us linear runtime
Tail Recursion 89.45 ============
foldLeftM 136.81 ===================
Java Style 22.07 ===
We see TailRec generally outperform Java, until the 200-Point case, where most of the points are outside of the Extent. It's unclear exactly why TailRec gets so bad.
If we knew ahead of time what the average OSM Way
length was, we might be able to make a more nuanced decision here about TailRec/Java. Clearly foldLeftM
is out of the question. Unfortunately the Foldable
typeclass from Cats
doesn't expose foldLeftM
yet (it's in cats-free
), so it's unclear if their solution would perform better.
None of the above three implementations yet account for the Corner Case of Doom™:
Since the line intersects this Tile, it would be associated to this Tile's SpatialKey
by the gridding algorithm. The question is, how should we check for this case?
One method would be:
last
and curr
) which lie outside the ExtentLine(last, curr)
last
to the current Point buffer and continue.This does what's expected for this immediate case plus a number of similar sister edge cases. I've implemented this for the Java variant of toNearestPoint
(since it was the fastest). The test suite now passes, but the benchmark results aren't promising:
Running benchmarks for toNearestPoint - ALL IN - two-point line...
Name us linear runtime
Java Style 5.58 ======
Java Style - Robust 5.49 ======
Running benchmarks for toNearestPoint - ALL IN - short line...
Name us linear runtime
Java Style 10.07 =====
Java Style - Robust 8.79 ==== <-- likely a fluke
Running benchmarks for toNearestPoint - ALL IN - medium line...
Name us linear runtime
Java Style 48.41 ======
Java Style - Robust 45.40 ======
Running benchmarks for toNearestPoint - MOST OUT - short line...
Name us linear runtime
Java Style 9.92 =
Java Style - Robust 29.74 =====
Running benchmarks for toNearestPoint - MOST OUT - medium line...
Name us linear runtime
Java Style 19.50 =
Java Style - Robust 270.75 =======================
I claim it's a very common case for most of the points of a Way
to be outside the current Extent. Therefore the MOST OUT
timings should be paid attention to most. Unless we can think of another way to check for this edge case, I recommend we avoid this check altogether, and don't attempt to save "pass through" line segments like this.
Here, the Line could fly off the north west "forever", and we'd be wasting processing time checking each pair of Points. Is there some heuristic we could apply to cut out sections of the Line that obviously won't have candidate Points?
I think there's a way to check the intersection of the segment rather than build a Line
and do an intersection operation. Remember, there's a lot of overhead for creating geometries and the Scala wrappers we have in geotrellis.vector, so whenever that can be avoided there is a huge win. So my guess is if we profiled most of the time would be spent in
- Form Line(last, curr)
So how do we form a quick segment intersection here, in the case both points lie outside? I'm imagining some horrific if/else block that determines which of the 9 sections the last
point lives in:
it starts with,
if(`last point is in 1`) {
if(`curr point is in 7, 5, or 8`) { `it cuts` }
} else if(`last point is in 2`) {
if(`curr poit is in 4, 5, 6, 7, or 8`) { `it cuts` }
} else if { `...` }
all done by checking x and y of the points with xmin, ymin, xmax, and ymax of the extent. Not a fun method to write, but I image will improve the performance you're seeing.
Thoughts on drilling down to the JTS level and testing intersects
there instead?
Intersects should be on our Extent
, which avoids the JTS polygon intersects for quick method: https://github.com/locationtech/geotrellis/blob/master/vector/src/main/scala/geotrellis/vector/Extent.scala#L132
For the point-based intersects I am doing that already, luckily. I didn't know it avoids JTS there, so that's a bonus.
I got around the problem with this: https://github.com/geotrellis/vectorpipe/blob/c5e44182a5722f5e957fd0060d87c26f6fc98576/src/main/scala/vectorpipe/geom/Clip.scala#L49
There was only a minor performance hit:
Running benchmarks for toNearestPoint - MOST OUT - medium line...
Name us linear runtime
Java Style 21.97 =
Java Style - Robust 240.92 ===================
Java Style - Robuster 25.67 ==
The new
toGrid
method injected intoRDD[OSMFeature]
asks for aclip
function which performs some "clipping strategy" on eachGeometry
. Strategies which immediately come to mind are:Of which (1), (2) and (5) are trivial, (4) could be adopted into any other, and (3) requires a lot of consideration.
While a user who uses VectorPipe as a tool is free to use any strategy they wish, (3) has been elected as the priority for Azavea's own use of VectorPipe. This comes with an up-front engineering cost, the particulars of which will be discussed in this issue.
Images to come. Strat 1 and 2 will be treated as the same, as their implementations (and results) are nearly identical.