WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Fetch tiles intersecting shape, not bounding box #44

Closed rajadain closed 7 years ago

rajadain commented 7 years ago

Overview

Previously we would fetch all tiles from S3 that intersected with the bounding box of the incoming shape. This was wasteful, especially in cases where the shape was a long diagonal.

Now, by cropping to the multiPolygon resulting from unioning all incoming shapes, rather than calculating an envelope, we ensure that only relevant tiles are fetched.

Since fetching from S3 is a sizable portion of our latency, this change improves performance 10-20%.

Connects #41

Demo

Here's a table of runtimes in seconds with nlcd-soils-request-huc8.json which is a Soils + NLCD RasterGroupedCount operation (used for TR-55) with the Schuylkill HUC-08, and nlcd-soils-request.json which is the same request but with the Little Neshaminy HUC-12 instead:

nlcd-soils-request-huc8 develop tt/waste-less-fetch-less
1 40.34 36.52
2 33.26 25.58
3 28.28 20.19
4 30.39 21.08
5 33.2 24.36

image

nlcd-soils-request develop tt/waste-less-fetch-less
1 5.22 3.77
2 4.45 4.1
3 4.05 3.58
4 4.26 3.38
5 4.72 3.59

image

Testing Instructions

Tagging @echeipesh for quick code review.

kellyi commented 7 years ago

Looks like I'm getting a few slightly different values for the huc8 json:

diff --git a/nlcd-soils-response-huc8.json b/nlcd-soils-response-huc8-new.json
index 70f8bc4..3ab2fde 100644
--- a/nlcd-soils-response-huc8.json
+++ b/nlcd-soils-response-huc8-new.json
@@ -101,7 +101,7 @@
     "List(11, 2, 28)": 166,
     "List(95, 7, 40)": 181,
     "List(23, 6, 41)": 663,
-    "List(23, 7, -2147483648)": 362,
+    "List(23, 7, -2147483648)": 345,
     "List(22, 4, 40)": 3155,
     "List(52, 2, 41)": 33832,
     "List(11, 6, -2147483648)": 59,
@@ -125,7 +125,7 @@
     "List(95, 2, 34)": 1,
     "List(43, 3, 2)": 33,
     "List(52, 6, 38)": 148,
-    "List(22, -2147483648, -2147483648)": 29987,
+    "List(22, -2147483648, -2147483648)": 29627,
     "List(21, -2147483648, 28)": 2752,
     "List(22, -2147483648, 34)": 543,
     "List(52, 2, 38)": 3976,
@@ -145,7 +145,6 @@
     "List(21, 2, 45)": 320,
     "List(52, 4, 2)": 5,
     "List(21, 6, -2147483648)": 639,
-    "List(81, -2147483648, 16)": 12,
     "List(82, 2, -2147483648)": 6291,
     "List(43, 4, 38)": 383,
     "List(52, -2147483648, 41)": 4454,
@@ -195,7 +194,7 @@
     "List(95, 7, 16)": 4,
     "List(82, 1, 41)": 38098,
     "List(23, 6, 28)": 45,
-    "List(81, -2147483648, -2147483648)": 1012,
+    "List(81, -2147483648, -2147483648)": 962,
     "List(52, 7, 30)": 2,
     "List(71, 6, 41)": 150,
     "List(24, 6, 41)": 158,
@@ -221,8 +220,8 @@
     "List(90, 4, 40)": 2509,
     "List(41, 7, 44)": 20,
     "List(95, -2147483648, 40)": 137,
-    "List(24, 7, -2147483648)": 61,
-    "List(24, -2147483648, -2147483648)": 42048,
+    "List(24, 7, -2147483648)": 56,
+    "List(24, -2147483648, -2147483648)": 41672,
     "List(82, 6, 28)": 33,
     "List(81, 2, 40)": 185423,
     "List(11, 3, 2)": 10,
@@ -242,8 +241,8 @@
     "List(82, 3, 44)": 1,
     "List(11, 6, 28)": 94,
     "List(95, 6, 40)": 503,
-    "List(21, -2147483648, 16)": 167,
-    "List(23, -2147483648, 16)": 385,
+    "List(21, -2147483648, 16)": 141,
+    "List(23, -2147483648, 16)": 298,
     "List(11, 2, 38)": 416,
     "List(41, -2147483648, 2)": 98,
     "List(82, 2, 34)": 3933,
@@ -255,9 +254,9 @@
     "List(11, -2147483648, 16)": 72,
     "List(42, 2, 28)": 1217,
     "List(23, 1, 2)": 5,
-    "List(41, -2147483648, -2147483648)": 7270,
+    "List(41, -2147483648, -2147483648)": 7249,
     "List(0, -2147483648, -2147483648)": 21,
-    "List(90, -2147483648, 41)": 3873,
+    "List(90, -2147483648, 41)": 3872,
     "List(42, 1, 41)": 3859,
     "List(11, 6, 2)": 5,
     "List(23, 2, 38)": 5058,
@@ -285,12 +284,12 @@
     "List(43, 2, 41)": 8036,
     "List(42, 3, 34)": 3,
     "List(42, 3, 38)": 158,
-    "List(21, 7, 41)": 30994,
+    "List(21, 7, 41)": 30985,
     "List(21, 2, 44)": 3,
     "List(23, -2147483648, 30)": 236,
     "List(41, 2, 38)": 25936,
     "List(11, 4, 28)": 208,
-    "List(22, 7, 41)": 7220,
+    "List(22, 7, 41)": 7187,
     "List(95, 7, 38)": 29,
     "List(24, 3, 38)": 325,
     "List(82, -2147483648, 38)": 1083,
@@ -385,7 +384,7 @@
     "List(0, 6, 41)": 3,
     "List(71, 3, 38)": 230,
     "List(90, 6, 34)": 2,
-    "List(21, 7, -2147483648)": 1543,
+    "List(21, 7, -2147483648)": 1530,
     "List(82, 3, 28)": 373,
     "List(95, 2, 40)": 234,
     "List(71, 7, 28)": 51,
@@ -421,8 +420,8 @@
     "List(21, 2, 41)": 123837,
     "List(21, 2, 34)": 791,
     "List(11, 6, 38)": 20,
-    "List(82, -2147483648, -2147483648)": 719,
-    "List(22, -2147483648, 41)": 70960,
+    "List(82, -2147483648, -2147483648)": 717,
+    "List(22, -2147483648, 41)": 70947,
     "List(82, 2, 30)": 4,
     "List(90, 2, 34)": 31,
     "List(90, 2, 38)": 1076,
@@ -445,7 +444,7 @@
     "List(81, 1, 2)": 87,
     "List(42, 3, 40)": 807,
     "List(42, 6, 40)": 693,
-    "List(31, 7, -2147483648)": 16,
+    "List(31, 7, -2147483648)": 8,
     "List(90, 1, 28)": 19,
     "List(11, 1, 28)": 1362,
     "List(22, 1, 44)": 19,
@@ -475,7 +474,6 @@
     "List(22, 2, 38)": 10925,
     "List(21, 2, 2)": 35,
     "List(22, 4, 28)": 1111,
-    "List(71, -2147483648, 16)": 56,
     "List(52, 6, 40)": 576,
     "List(52, 1, 2)": 16,
     "List(90, 1, 44)": 6,
@@ -483,7 +481,7 @@
     "List(22, 1, 28)": 2997,
     "List(21, 4, 30)": 5,
     "List(81, 2, 30)": 5,
-    "List(23, 7, 41)": 1685,
+    "List(23, 7, 41)": 1681,
     "List(11, 1, -2147483648)": 644,
     "List(23, 6, 30)": 2,
     "List(90, -2147483648, 30)": 5,
@@ -492,7 +490,7 @@
     "List(31, 4, 41)": 735,
     "List(21, 7, 28)": 917,
     "List(52, -2147483648, 45)": 5,
-    "List(71, -2147483648, -2147483648)": 239,
+    "List(71, -2147483648, -2147483648)": 183,
     "List(24, 2, 30)": 1,
     "List(21, 1, 28)": 11665,
     "List(71, 4, 40)": 378,
@@ -518,8 +516,7 @@
     "List(82, -2147483648, 41)": 3902,
     "List(23, -2147483648, 2)": 1,
     "List(41, 1, 44)": 477,
-    "List(52, -2147483648, 16)": 12,
-    "List(52, -2147483648, -2147483648)": 509,
+    "List(52, -2147483648, -2147483648)": 494,
     "List(43, 2, -2147483648)": 559,
     "List(82, 4, -2147483648)": 1147,
     "List(42, -2147483648, 34)": 9,
@@ -599,7 +596,7 @@
     "List(42, 7, 2)": 36,
     "List(81, 3, 44)": 10,
     "List(90, -2147483648, 45)": 13,
-    "List(22, -2147483648, 16)": 255,
+    "List(22, -2147483648, 16)": 206,
     "List(21, 1, 38)": 4884,
     "List(11, 4, -2147483648)": 109,
     "List(42, 7, 38)": 64,
@@ -647,7 +644,7 @@
     "List(22, 1, 45)": 2,
     "List(21, 2, 40)": 73631,
     "List(82, 3, -2147483648)": 1234,
-    "List(21, -2147483648, -2147483648)": 24285,
+    "List(21, -2147483648, -2147483648)": 24079,
     "List(0, -2147483648, 38)": 114,
     "List(23, 4, 28)": 451,
     "List(24, 6, 28)": 4,
@@ -661,7 +658,7 @@
     "List(42, 2, 34)": 13,
     "List(21, 3, 28)": 1371,
     "List(81, 1, 40)": 62844,
-    "List(23, -2147483648, 41)": 39571,
+    "List(23, -2147483648, 41)": 39561,
     "List(24, -2147483648, 38)": 24365,
     "List(0, -2147483648, 45)": 1,
     "List(31, 1, 28)": 8346,
@@ -715,7 +712,7 @@
     "List(52, 7, 28)": 74,
     "List(90, 7, 44)": 10,
     "List(95, 2, -2147483648)": 29,
-    "List(22, 7, -2147483648)": 538,
+    "List(22, 7, -2147483648)": 494,
     "List(43, 7, 28)": 1074,
     "List(43, -2147483648, 41)": 1121,
     "List(81, 1, 44)": 97,
@@ -744,7 +741,7 @@
     "List(81, 2, 2)": 122,
     "List(21, -2147483648, 45)": 304,
     "List(22, 2, 40)": 21162,
-    "List(21, -2147483648, 41)": 72830,
+    "List(21, -2147483648, 41)": 72825,
     "List(90, -2147483648, 16)": 33,
     "List(21, 3, 45)": 40,
     "List(21, 7, 16)": 48,
@@ -761,7 +758,7 @@
     "List(82, 4, 28)": 274,
     "List(41, -2147483648, 24)": 2,
     "List(81, 6, 41)": 12230,
-    "List(90, -2147483648, -2147483648)": 1160,
+    "List(90, -2147483648, -2147483648)": 1070,
     "List(41, 7, 40)": 35307,
     "List(42, 3, -2147483648)": 233,
     "List(90, 3, 44)": 12,
@@ -804,7 +801,7 @@
     "List(41, 4, 45)": 6,
     "List(11, 1, 38)": 18,
     "List(21, 6, 45)": 13,
-    "List(23, -2147483648, -2147483648)": 45630,
+    "List(23, -2147483648, -2147483648)": 45106,
     "List(81, 7, -2147483648)": 558,
     "List(71, -2147483648, 2)": 8,
     "List(41, 2, 28)": 18416,
@@ -812,7 +809,7 @@
     "List(71, 1, 2)": 70,
     "List(23, 2, 34)": 261,
     "List(11, 6, 41)": 945,
-    "List(24, -2147483648, 16)": 183,
+    "List(24, -2147483648, 16)": 141,
     "List(42, -2147483648, 28)": 395,
     "List(22, 2, 45)": 96,
     "List(31, -2147483648, 34)": 49,
kellyi commented 7 years ago

The responses from the other JSON were the same, btw.

rajadain commented 7 years ago

Hmm, okay, taking a look.

kellyi commented 7 years ago

Try interacting with the main app. Notice if long, thin, diagonal shapes take less time than staging

Also having some trouble getting this to work in the main app. I put the new geoprocessing service in place, but the app now returns 500s for the analysis endpoints:

screen shot 2017-05-23 at 1 54 39 pm
rajadain commented 7 years ago

I split the request into three individual RasterGroupedCount calls, each requesting one layer, and found the values of NLCD and Soil Texture to be identical, but of Soil Group different between the two:

nlcd-request-huc8.json.txt nlcd-response-huc8.json.txt nlcd-response-huc8-new.json.txt

soil-request-huc8.json.txt soil-response-huc8.json.txt soil-response-huc8-new.json.txt

soil-texture-request-huc8.json.txt soil-texture-response-huc8.json.txt soil-texture-response-huc8-new.json.txt

$ diff -u nlcd-response-huc8.json nlcd-response-huc8-new.json
$ diff -u soil-response-huc8.json soil-response-huc8-new.json

--- soil-response-huc8.json 2017-05-23 13:44:53.000000000 -0400
+++ soil-response-huc8-new.json 2017-05-23 13:47:09.000000000 -0400
@@ -1,11 +1,11 @@
 {
   "result": {
     "List(2)": 1900300,
-    "List(-2147483648)": 834465,
+    "List(-2147483648)": 832452,
     "List(4)": 654295,
     "List(3)": 490209,
     "List(6)": 120754,
-    "List(7)": 348163,
+    "List(7)": 348030,
     "List(1)": 1154272
   }
 }
\ No newline at end of file
$ diff -u soil-texture-response-huc8.json soil-texture-response-huc8-new.json

Investigating why this is.

kellyi commented 7 years ago

Also having some trouble getting this to work in the main app. I put the new geoprocessing service in place, but the app now returns 500s for the analysis endpoints:

This turned out to be an error on my local: neglected to rebundle after reviewing a PR opened prior to merging some updates to how the client & API handle AOIs.

rajadain commented 7 years ago

Okay after doing some more tests, I'm finding that there are consistent differences across all layers (I may have made a mistake with the above results). Comparing with values on staging, I find that there's around ~0.1% difference in the values, with values on this branch being less than the previous values. See calculations here: https://codepen.io/anon/pen/zweBPG?editors=0012:

Average Count Diff Percent: 0.0872089680750769
Average Area Diff Percent: 0.09302289928008084
Average Coverage Diff Percent: 0.09542434609139239

@echeipesh Do you have any idea why this is happening?

@mmcfarland & @ajrobbins Would a 0.1% difference in results be worth the 10-20% improvement in performance?

mmcfarland commented 7 years ago

Area and Coverage are already a little murky to begin with given that we're doing fuzzy logic on latitude and pixel size to come up with that, I think it's well within the tolerance range. The total number of cells that are counted are different, which seems a little more concerning (and likely explain why the rest of the results also differ):

We're now accounting for 2,000 fewer cells @ 30m2 each. My guess is it's a boundary thing. Would be interested in speculation on why the results differ, so we at least have some justification for the change. For all we know, it may be more accurate, but we've probably got to account for that 60k m2 before we commit to it.

You might be able to do a simple test by supplying a very small AoI with just a handful of cells in it and seeing the number of total cells counted.

For reference, I summed with:

let sumOld = 0;
let sumNew = 0;

for (let x in oldValues.result) {
  sumOld += oldValues.result[x];
}
for (let x in newValues.result) {
  sumNew += newValues.result[x];
}
rajadain commented 7 years ago

Using smaller AoIs, like the 71 km² HUC-12 Middle Conestoga River near Leacock, PA, the results are identical. Going to try with a HUC-10 now.

rajadain commented 7 years ago

Just tried the 698 km² HUC-10 Conestoga River near Leacock PA, and staging counts 774661, while develop only counts 748228. The difference of 26433 cells is almost ~3.5%, so more significant than we've seen so far.

Perhaps we should hold off on this until we can explain this properly.

The main difference here is between clipping using GeometryCollection(polygons).envelope and polygons.unionGeometries.asMultiPolygon.get.

kellyi commented 7 years ago

Wonder if there's some quirky difference between the intersection methods for Extent and MultiPolygon types?

rajadain commented 7 years ago

@kellyi seems to have hit the nail on the head. Intersection for extent vs multiPolygon

rajadain commented 7 years ago

Investigated this a bit more with @echeipesh today, and will continue tomorrow. Using the Conestoga River HUC-10 which had a difference of 3.5%, and running both types of intersections, we get the following tiles:

huc10-conestoga-envelope-intersect.geojson.txt

image

huc10-conestoga-multipolygon-intersect.geojson.txt

image

And for reference, here is the Scala Worksheet we're working with: fetch-less.sc.txt

Once we figure out the reprojection errors in the tiles (which hopefully isn't a symptom of something worse) we may be able to pinpoint the missing tile, sourcing the missing values.

rajadain commented 7 years ago

Here are the corresponding images for the Schuylkill HUC-08, which had a difference of 0.1%:

image

image

echeipesh commented 7 years ago

This is visualization of the query in Albers projection: image

echeipesh commented 7 years ago

Bug found and fixed: https://github.com/locationtech/geotrellis/pull/2213

rajadain commented 7 years ago

As the above fix will be in GeoTrellis 1.1, and likely will not be back-ported to 0.10, we cannot use the multipolygon intersection optimization in the current implementation. Thus I am closing this pull request.

As we are considering upgrading to GeoTrellis 1.1 for the Collections API as part of https://github.com/WikiWatershed/model-my-watershed/pull/1919, it is possible we may still get to use this functionality in the future.