WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Add RasterSummary Operation #79

Closed rajadain closed 6 years ago

rajadain commented 6 years ago

Overview

Adds RasterSummary operation, needed to generate summary statistics (min, average, and max) for a list of layers within a given shape.

In accordance with the LongAdder and DoubleAdder used for counting and averaging in RasterGroupedAverage, we use a DoubleAccumulator for both min and max values, with custom MinWithoutNoData and MaxWithoutNoData implementations that ignore NoData values and only count the rest.

This will only work with Double type layers, not with Int layers.

Connects #78 Connects https://github.com/WikiWatershed/model-my-watershed/issues/2346

Demo

$ http :8090/run < examples/MapshedJob_RasterSummary.json

HTTP/1.1 200 OK
Content-Length: 124
Content-Type: application/json
Date: Fri, 29 Dec 2017 23:08:53 GMT
Server: akka-http/10.0.9

{
    "result": [
        {
            "avg": 799.9116994958924,
            "max": 3430.0,
            "min": -38.0
        },
        {
            "avg": 1.2253132788318364,
            "max": 28.67453956604004,
            "min": 0.0
        }
    ]
}

Thanks to @arottersman who confirmed the first set of values above by running GDAL using the same shape on the original TIFF and getting:

Band 1 Block=475x4 Type=Int32, ColorInterp=Gray
  Description = elev_cm
  Minimum=-38.000, Maximum=3430.000, Mean=799.912, StdDev=630.396
  NoData Value=-2147483647
  Metadata:
    LAYER_TYPE=athematic
    STATISTICS_MAXIMUM=3430
    STATISTICS_MEAN=799.91169949589
    STATISTICS_MINIMUM=-38
    STATISTICS_STDDEV=630.39608483669
$ ./scripts/benchmark --rs

Timing HUC8 RasterSummary ->

HUC8 RasterSummary, run 1 -> 5.618318 s
HUC8 RasterSummary, run 2 -> 4.926943 s
HUC8 RasterSummary, run 3 -> 6.105659 s
HUC8 RasterSummary, run 4 -> 4.520898 s
HUC8 RasterSummary, run 5 -> 4.682257 s

HUC8 RasterSummary average -> 5.170815 s

Notes

This is the only operation that returns a result not in this format:

{
  "List(1,2)": 3,
  "List(4,5)": 6
}

thus will need a custom parser in Django / Celery.

Testing Instructions

kellyi commented 6 years ago

Setting this up to take a look!

kellyi commented 6 years ago
$ ./scripts/benchmark --rs
Timing HUC8 RasterSummary ->

HUC8 RasterSummary, run 1 -> 9.404321 s
HUC8 RasterSummary, run 2 -> 5.998905 s
HUC8 RasterSummary, run 3 -> 6.975526 s
HUC8 RasterSummary, run 4 -> 5.116356 s
HUC8 RasterSummary, run 5 -> 4.945487 s

HUC8 RasterSummary average -> 6.488119 s
kellyi commented 6 years ago
$ http :8090/run < examples/MapshedJob_RasterSummary.json
HTTP/1.1 200 OK
Content-Length: 124
Content-Type: application/json
Date: Tue, 02 Jan 2018 16:12:40 GMT
Server: akka-http/10.0.9

{
    "result": [
        {
            "avg": 799.9116994958924,
            "max": 3430.0,
            "min": -38.0
        },
        {
            "avg": 1.2253132788318364,
            "max": 28.67453956604004,
            "min": 0.0
        }
    ]
}
kellyi commented 6 years ago

Working well! Going to take a look at the code now.

I'd be interested in very quickly doing https://github.com/WikiWatershed/mmw-geoprocessing/issues/56 since the current README doesn't have accurate instructions wrt setting up this project for development. I believe we can dump the Usage section altogether and replace it with:

...each of which should be pretty small, mostly just enumerating things

rajadain commented 6 years ago

I'd prefer to do #56 on its own since this is already sharing points with https://github.com/WikiWatershed/model-my-watershed/issues/2346, but I agree its been left for quite some time and should be done soon.

rajadain commented 6 years ago

Since this comment was clobbered by a code change, I'm rewriting it here for easy searching

All these operations seem to be working well and I don't think we should rewrite them, but when I looked up isNoData to check what it did I encountered some built-in summary ops in GeoTrellis:

https://github.com/locationtech/geotrellis/tree/master/raster/src/main/scala/geotrellis/raster/summary

Didn't delve too far into figuring out how to use them but the names, at least, suggest that they may be able to do similar operations. That said, this is working as expected so unless there are major performance gains to be had from using the built-in GeoTrellis versions I don't think we should rewrite. May be worth asking in #geotrellis

I paired with @jamesmcclain for a bit this afternoon and this is what we came up with:

private def rasterSummary2(
  rasterLayers: Seq[TileLayerCollection[SpatialKey]],
  multiPolygon: MultiPolygon,
  opts: Rasterizer.Options
): Seq[Map[String, Double]] = {

  val p = multiPolygon.polygons.head

  val mins = rasterLayers.map { rl =>
    MinDoubleSummary.combineResults(rl.map { case (key, tile) =>
      val extent = rl.metadata.mapTransform(key)
      val rt = Raster(tile, extent)
      MinDoubleSummary.handlePartialTile(rt, p)
    })
  }
  val means = rasterLayers.map { rl =>
    MeanSummary.combineResults(rl.map { case (key, tile) =>
        val extent = rl.metadata.mapTransform(key)
        val rt = Raster(tile, extent)
        MeanSummary.handlePartialTile(rt, p)
    }).mean
  }
  val maxs = rasterLayers.map { rl =>
    MaxDoubleSummary.combineResults(rl.map { case (key, tile) =>
      val extent = rl.metadata.mapTransform(key)
      val rt = Raster(tile, extent)
      MaxDoubleSummary.handlePartialTile(rt, p)
    })
  }

  (mins zip means zip maxs) map { case ((min, avg), max) =>
    Map(
      "min" -> min,
      "avg" -> avg,
      "max" -> max
    )
  }
}

Here's a patch file of the entire change: rasterPolygonalSummary.patch.txt

We found that the performance was slightly slower (but probably within noise range):

$ kj benchmark --rs
Timing HUC8 RasterSummary ->

HUC8 RasterSummary, run 1 -> 6.619486 s
HUC8 RasterSummary, run 2 -> 6.111745 s
HUC8 RasterSummary, run 3 -> 5.645586 s
HUC8 RasterSummary, run 4 -> 5.995264 s
HUC8 RasterSummary, run 5 -> 5.654279 s

HUC8 RasterSummary average -> 6.005272 s

I think we should stick with the code already in this PR for the following reasons:

But this was still a very educational exercise and I feel more confident in our code thanks to it.

rajadain commented 6 years ago

Thanks for taking a look! Going to squash the fixups and merge shortly.