Closed rajadain closed 6 years ago
Setting this up to take a look!
$ ./scripts/benchmark --rs
Timing HUC8 RasterSummary ->
HUC8 RasterSummary, run 1 -> 9.404321 s
HUC8 RasterSummary, run 2 -> 5.998905 s
HUC8 RasterSummary, run 3 -> 6.975526 s
HUC8 RasterSummary, run 4 -> 5.116356 s
HUC8 RasterSummary, run 5 -> 4.945487 s
HUC8 RasterSummary average -> 6.488119 s
$ http :8090/run < examples/MapshedJob_RasterSummary.json
HTTP/1.1 200 OK
Content-Length: 124
Content-Type: application/json
Date: Tue, 02 Jan 2018 16:12:40 GMT
Server: akka-http/10.0.9
{
"result": [
{
"avg": 799.9116994958924,
"max": 3430.0,
"min": -38.0
},
{
"avg": 1.2253132788318364,
"max": 28.67453956604004,
"min": 0.0
}
]
}
Working well! Going to take a look at the code now.
I'd be interested in very quickly doing https://github.com/WikiWatershed/mmw-geoprocessing/issues/56 since the current README doesn't have accurate instructions wrt setting up this project for development. I believe we can dump the Usage
section altogether and replace it with:
...each of which should be pretty small, mostly just enumerating things
I'd prefer to do #56 on its own since this is already sharing points with https://github.com/WikiWatershed/model-my-watershed/issues/2346, but I agree its been left for quite some time and should be done soon.
Since this comment was clobbered by a code change, I'm rewriting it here for easy searching
All these operations seem to be working well and I don't think we should rewrite them, but when I looked up
isNoData
to check what it did I encountered some built-in summary ops in GeoTrellis:Didn't delve too far into figuring out how to use them but the names, at least, suggest that they may be able to do similar operations. That said, this is working as expected so unless there are major performance gains to be had from using the built-in GeoTrellis versions I don't think we should rewrite. May be worth asking in #geotrellis
I paired with @jamesmcclain for a bit this afternoon and this is what we came up with:
private def rasterSummary2(
rasterLayers: Seq[TileLayerCollection[SpatialKey]],
multiPolygon: MultiPolygon,
opts: Rasterizer.Options
): Seq[Map[String, Double]] = {
val p = multiPolygon.polygons.head
val mins = rasterLayers.map { rl =>
MinDoubleSummary.combineResults(rl.map { case (key, tile) =>
val extent = rl.metadata.mapTransform(key)
val rt = Raster(tile, extent)
MinDoubleSummary.handlePartialTile(rt, p)
})
}
val means = rasterLayers.map { rl =>
MeanSummary.combineResults(rl.map { case (key, tile) =>
val extent = rl.metadata.mapTransform(key)
val rt = Raster(tile, extent)
MeanSummary.handlePartialTile(rt, p)
}).mean
}
val maxs = rasterLayers.map { rl =>
MaxDoubleSummary.combineResults(rl.map { case (key, tile) =>
val extent = rl.metadata.mapTransform(key)
val rt = Raster(tile, extent)
MaxDoubleSummary.handlePartialTile(rt, p)
})
}
(mins zip means zip maxs) map { case ((min, avg), max) =>
Map(
"min" -> min,
"avg" -> avg,
"max" -> max
)
}
}
Here's a patch file of the entire change: rasterPolygonalSummary.patch.txt
We found that the performance was slightly slower (but probably within noise range):
$ kj benchmark --rs
Timing HUC8 RasterSummary ->
HUC8 RasterSummary, run 1 -> 6.619486 s
HUC8 RasterSummary, run 2 -> 6.111745 s
HUC8 RasterSummary, run 3 -> 5.645586 s
HUC8 RasterSummary, run 4 -> 5.995264 s
HUC8 RasterSummary, run 5 -> 5.654279 s
HUC8 RasterSummary average -> 6.005272 s
I think we should stick with the code already in this PR for the following reasons:
But this was still a very educational exercise and I feel more confident in our code thanks to it.
Thanks for taking a look! Going to squash the fixups and merge shortly.
Overview
Adds
RasterSummary
operation, needed to generate summary statistics (min, average, and max) for a list of layers within a given shape.In accordance with the
LongAdder
andDoubleAdder
used for counting and averaging inRasterGroupedAverage
, we use aDoubleAccumulator
for both min and max values, with customMinWithoutNoData
andMaxWithoutNoData
implementations that ignoreNoData
values and only count the rest.This will only work with
Double
type layers, not withInt
layers.Connects #78 Connects https://github.com/WikiWatershed/model-my-watershed/issues/2346
Demo
Thanks to @arottersman who confirmed the first set of values above by running GDAL using the same shape on the original TIFF and getting:
Notes
This is the only operation that returns a result not in this format:
thus will need a custom parser in Django / Celery.
Testing Instructions