Open rajadain opened 1 month ago
I've generated a flame graph for the run of the Lower Schuylkill example:
The slowdown relative to Python appears to be arising from the fact that you're doing two reproject operations: once in MosaicRasterSource
and once in StacAssetReprojectRasterSource
. I'm not clear on whether both operations are necessary, but if this can be consolidated down to a single reproject step, then you should be in much better shape, since those two operations are currently taking up 35% and 48% of the total runtime (though I can't guarantee that runtime is exactly accurate, since I'm not a Java Flight Recorder expert).
Edit: looking at this more closely, I misspoke: there's a resample operation and a reproject operation, both arising from MosaicRasterSource
that should be joinable. I know from experience that reprojection should be able to resample in a single operation.
Thanks! Taking a look at this now. How did you generate this flamegraph? Would be useful to do as I iterate on this.
Looks like we're reprojecting on line 83, but also passing the targetCRS as a parameter in lines 88 and 90:
If I remove the reprojection from line 83, I get an empty output (albeit much faster, likely because it's not reading any data 😅)
diff --git a/api/src/main/scala/package.scala b/api/src/main/scala/package.scala
index 199fce0..010a792 100644
--- a/api/src/main/scala/package.scala
+++ b/api/src/main/scala/package.scala
@@ -80,14 +80,14 @@ package object geoprocessing {
sources match {
case head :: Nil => head.some
case head :: tail =>
- val reprojectedSources = NonEmptyList.of(head, tail: _*).map(_.reproject(targetCRS))
- val attributes = reprojectedSources.toList.attributesByName
+ val sources = NonEmptyList.of(head, tail: _*)
+ val attributes = sources.toList.attributesByName
val mosaicRasterSource =
if (parallelMosaicEnabled)
- MosaicRasterSourceIO.instance(reprojectedSources, targetCRS, collectionName, attributes)(IORuntime.global)
+ MosaicRasterSourceIO.instance(sources, targetCRS, collectionName, attributes)(IORuntime.global)
else
- MosaicRasterSource.instance(reprojectedSources, targetCRS, collectionName, attributes)
+ MosaicRasterSource.instance(sources, targetCRS, collectionName, attributes)
mosaicRasterSource.some
case _ => None
time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson
{}
________________________________________________________
Executed in 6.81 secs fish external
usr time 20.76 millis 0.10 millis 20.65 millis
sys time 21.00 millis 1.04 millis 19.95 millis
I'll see if I can figure out how to do the resampling only once while still selecting the right data. The above may imply that 6.8s is a timing floor below which we cannot go.
Overview
Adds a Stac Summary endpoint that takes a shape, a year, a STAC URI and a STAC Collection, and returns the histogram of pixels intersecting the AoI.
There is also a sister repository https://github.com/rajadain/mmw-io-10m-lulc-summary which has helper scripts to exercise this new endpoint, as well as run a Python based comparison for two sample shapes.
Currently, the Python implementation is faster:
with the GeoTrellis implementation operating at 0.5-0.75x the Python speed.
The resulting numbers are quite comparable though:
Note that "List(0)" represents NODATA values which are present in the Python output but not in the GeoTrellis one. Ultimately they are ignored so it doesn't matter.
The close percentage values are especially promising, because that is what is used in Model My Watershed.
Closes #113
Demo
Notes
I'm looking for help in three places:
Rasterizer.foreachCellByMultiPolygon()
to count the pixels. Here we're doing amask
and thenhistogram.binCount()
. Is this implementation correct?Testing Instructions
./scripts/update && ./script/server
./scripts/setup
./scripts/run_python examples/LowerSchuylkill.geojson
to get a baseline./scripts/run_geotrellis examples/LowerSchuylkill.geojson
to exercise this new endpoint