WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Implement Raster Grouped Average operation #65

Closed kellyi closed 6 years ago

kellyi commented 6 years ago

Overview

This PR implements the "RasterGroupedAverage" operation. Behind the API interface there are two ops: rasterAverage for requests which don't include a list of rasters, and rasterGroupedAverage for those which do. Some implementation details are in the "Notes" section below.

Connects #52

Demo

https://github.com/WikiWatershed/mmw-geoprocessing/issues/52 has two sample inputs: one for the rasterGroupedAverage and another for rasterAverage.

Current output for the `rasterGroupedAverage` input: ``` { "result": { "List(42)": 0.23512719654500847, "List(22)": 0.21115289390947675, "List(43)": 0.2440073603213253, "List(71)": 0.26802137526366693, "List(41)": 0.268620081816594, "List(21)": 0.23515533595797652, "List(24)": 0.1859889512692598, "List(31)": 0.27985624785802954, "List(90)": 0.25720720833308114, "List(52)": 0.27157785493664266, "List(11)": 0.2281515911560167, "List(23)": 0.20484685031909086, "List(82)": 0.27604142433856804, "List(81)": 0.2714132968948771, "List(95)": 0.31307519406197576 } } ```
Current output for `rasterAverage` op ``` { "result": { "List(0)": 9.937211446569115 } } ```

Notes

As noted in comments on #52 I encountered some initial difficulties in ensuring the rasterGroupedAverage would return the same results each time: the values were very close to the desired output, but off by a tenth or so of a decimal each time and not deterministic.

It turns out this was because I was trying to store the values as a list of doubles, then list.sum / list.length to get the final value. Since these ops use .par to parallelize and I was trying out resetting the values by taking the old list and concating the new target value, I think parallelizing it meant some of these intervening updates would be overwritten later.

To fix it while keeping the .par call I ended up using a tuple of a DoubleAdder & LongAdder to store the accumulator target values and the count of values, respectively, since the ...Adders are designed to be threadsafe:

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/atomic/DoubleAdder.html

Testing

kellyi commented 6 years ago

Thanks for your help with this! I made the changes suggested above & squashed them into one commit then tested everything again to ensure it still works. Still does, so going to merge.