WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Crop incoming lines to Area of Interest #77

Closed rajadain closed 6 years ago

rajadain commented 6 years ago

Overview

Previously we were cropping this in PostGIS which was taking forever (see WikiWatershed/model-my-watershed#2426 and WikiWatershed/model-my-watershed#2502). Since doing it in Scala is negligibly slower than not, we shift the cropping to the geoprocessing service from PostGIS.

Connects WikiWatershed/model-my-watershed#2502 :zap:

Testing Instructions

kellyi commented 6 years ago

Q: instead of using the scripts above will the existing ./scripts/benchmark work just as well for testing purposes?

kellyi commented 6 years ago

Nevermind, looks like we don't have a benchmarking test on RasterLinesJoin? Wonder if it's worth adding one.

rajadain commented 6 years ago

I added the latest RasterLinesJoin to the benchmark script with the old behavior, then added my fix, then updated the benchmark script with the new behavior, if you'd like to use that instead.

rajadain commented 6 years ago

The order of commits in GitHub is incorrect (because it uses Author date, not Commit date). The true order is this:

* 118eeea - (HEAD -> tt/crop-lines-to-aoi, origin/tt/crop-lines-to-aoi) Update RasterLinesJoin benchmark (66 seconds ago) <Terence Tuhinanshu>
* e9e0aa6 - Crop incoming lines to area of interest (83 seconds ago) <Terence Tuhinanshu>
* 1818cb8 - Add RasterLinesJoin benchmark (83 seconds ago) <Terence Tuhinanshu>
*   de66670 - (origin/develop, origin/HEAD, develop) Merge pull request #75 from WikiWatershed/ki/improve-error-reporting (8 weeks ago) <Kelly Innes>
kellyi commented 6 years ago

Great! Thanks for adding that.

kellyi commented 6 years ago

RLJ timings on develop:

Timing HUC8 RasterLinesJoin ->

HUC8 RasterLinesJoin, run 1 -> 6.944408 s
HUC8 RasterLinesJoin, run 2 -> 1.918769 s
HUC8 RasterLinesJoin, run 3 -> 2.046422 s
HUC8 RasterLinesJoin, run 4 -> 1.342596 s
HUC8 RasterLinesJoin, run 5 -> 1.817221 s

HUC8 RasterLinesJoin average -> 2.8138832 s
kellyi commented 6 years ago

This is probably a more accurate RLJ run from develop since it leaves off the initial 7 second warmup run:

Timing HUC8 RasterLinesJoin ->

HUC8 RasterLinesJoin, run 1 -> 2.000698 s
HUC8 RasterLinesJoin, run 2 -> 1.803359 s
HUC8 RasterLinesJoin, run 3 -> 1.714622 s
HUC8 RasterLinesJoin, run 4 -> 1.753615 s
HUC8 RasterLinesJoin, run 5 -> 1.450551 s

HUC8 RasterLinesJoin average -> 1.744569 s
kellyi commented 6 years ago

Here's the benchmark timing from this branch after a few runs:

Timing HUC8 RasterLinesJoin ->

HUC8 RasterLinesJoin, run 1 -> 1.379907 s
HUC8 RasterLinesJoin, run 2 -> 1.380688 s
HUC8 RasterLinesJoin, run 3 -> 1.442743 s
HUC8 RasterLinesJoin, run 4 -> 1.693359 s
HUC8 RasterLinesJoin, run 5 -> 1.430437 s

HUC8 RasterLinesJoin average -> 1.4654268 s
kellyi commented 6 years ago

Ensure the two output files are different.

Output differs:

$ diff develop-output.json develop-clipped-streams.json
3,5c3,5
<         "List(42)": 2149,
<         "List(22)": 3252,
<         "List(43)": 2387,
---
>         "List(42)": 2215,
>         "List(22)": 3253,
>         "List(43)": 2393,
7,8c7,8
<         "List(41)": 33341,
<         "List(21)": 11553,
---
>         "List(41)": 33404,
>         "List(21)": 11558,
12,14c12,14
<         "List(90)": 14005,
<         "List(52)": 2874,
<         "List(11)": 8095,
---
>         "List(90)": 14014,
>         "List(52)": 2876,
>         "List(11)": 8128,
16,17c16,17
<         "List(82)": 4736,
<         "List(81)": 10301,
---
>         "List(82)": 4745,
>         "List(81)": 10314,
kellyi commented 6 years ago

New output from this branch matches the develop-output.json above:

screen shot 2017-11-15 at 5 50 36 pm

kellyi commented 6 years ago

This seems like it's good to go in: it works as described by the testing instructions, and the benchmarking scripts don't show much much difference in the timings. Makes me wonder whether there are other operations we're doing Python-side that could be sped up by moving them here.

mmcfarland commented 6 years ago

I apologize that this comment comes from a place of limited context, but how does this relate to the existing work done in https://github.com/WikiWatershed/model-my-watershed/pull/2506? My concern about the comparison between PostGIS vs this branch is that it may have been done in the context of VM resources in the development environment. This may indeed be an improvement, but I'd like to see a more detailed description of the decision points and observations that led to this changeset, and how it relates to the previous work proposed.

rajadain commented 6 years ago

Spoke with @mmcfarland in person about the motivations and effects, and we have a green light. I'm going to merge this and make a new release, then include it in WikiWatershed/model-my-watershed#2502.