WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Add STAC Summary Endpoint for Global 10m LULC Data #117

Open rajadain opened 1 month ago

rajadain commented 1 month ago

Overview

Adds a Stac Summary endpoint that takes a shape, a year, a STAC URI and a STAC Collection, and returns the histogram of pixels intersecting the AoI.

There is also a sister repository https://github.com/rajadain/mmw-io-10m-lulc-summary which has helper scripts to exercise this new endpoint, as well as run a Python based comparison for two sample shapes.

Currently, the Python implementation is faster:

time ./scripts/run_python examples/LowerSchuylkill.geojson

{
  "List(0)": 19620912,
  "List(1)": 113354,
  "List(11)": 400816,
  "List(2)": 831411,
  "List(4)": 369,
  "List(5)": 225703,
  "List(7)": 3708295,
  "List(8)": 4074
}

________________________________________________________
Executed in   11.88 secs    fish           external
   usr time    2.47 secs    0.08 millis    2.47 secs
   sys time    2.39 secs    1.22 millis    2.39 secs
time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson

{
  "List(1)": 106351,
  "List(11)": 390430,
  "List(2)": 792084,
  "List(4)": 3060,
  "List(5)": 173543,
  "List(7)": 3529171,
  "List(8)": 3357
}

________________________________________________________
Executed in   15.82 secs    fish           external
   usr time    0.38 secs  131.00 micros    0.38 secs
   sys time    3.85 secs  994.00 micros    3.85 secs

with the GeoTrellis implementation operating at 0.5-0.75x the Python speed.

The resulting numbers are quite comparable though:

image

image

Note that "List(0)" represents NODATA values which are present in the Python output but not in the GeoTrellis one. Ultimately they are ignored so it doesn't matter.

The close percentage values are especially promising, because that is what is used in Model My Watershed.

Closes #113

Demo

image
xh --print HhBb :8090/stac < scratch/request.json
POST /stac HTTP/1.1
Accept: application/json, */*;q=0.5
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 149983
Content-Type: application/json
Host: localhost:8090
User-Agent: xh/0.22.0

{
    "shape": "{\"type\":\"MultiPolygon\",\"coordinates\":[[[[-75.638687083659,40.3002222895293],...]]]}",
    "year": 2019,
    "stacUri": "https://api.impactobservatory.com/stac-aws",
    "stacCollection": "io-10m-annual-lulc"
}

HTTP/1.1 200 OK
Content-Length: 118
Content-Type: application/json
Date: Thu, 30 May 2024 01:14:45 GMT
Server: akka-http/10.5.3

{
    "List(2)": 792084,
    "List(5)": 173543,
    "List(11)": 390430,
    "List(7)": 3529171,
    "List(1)": 106351,
    "List(4)": 3060,
    "List(8)": 3357
}

Notes

I'm looking for help in three places:

  1. Correctness: In the Spark RDD case, we use Rasterizer.foreachCellByMultiPolygon() to count the pixels. Here we're doing a mask and then histogram.binCount(). Is this implementation correct?
  2. Performance: In my tests using the https://github.com/rajadain/mmw-io-10m-lulc-summary repo, the Python version is twice as fast as the GeoTrellis version. How can this be improved?
  3. Idiomaticness: Is the code written in an idiomatic GeoTrellis style?

Testing Instructions

jpolchlo commented 1 week ago

I've generated a flame graph for the run of the Lower Schuylkill example: flamegraph The slowdown relative to Python appears to be arising from the fact that you're doing two reproject operations: once in MosaicRasterSource and once in StacAssetReprojectRasterSource. I'm not clear on whether both operations are necessary, but if this can be consolidated down to a single reproject step, then you should be in much better shape, since those two operations are currently taking up 35% and 48% of the total runtime (though I can't guarantee that runtime is exactly accurate, since I'm not a Java Flight Recorder expert).

Edit: looking at this more closely, I misspoke: there's a resample operation and a reproject operation, both arising from MosaicRasterSource that should be joinable. I know from experience that reprojection should be able to resample in a single operation.

rajadain commented 1 week ago

Thanks! Taking a look at this now. How did you generate this flamegraph? Would be useful to do as I iterate on this.

Looks like we're reprojecting on line 83, but also passing the targetCRS as a parameter in lines 88 and 90:

https://github.com/WikiWatershed/mmw-geoprocessing/blob/025bfacbf8b29af2376ed03ddcfb3a8052486754/api/src/main/scala/package.scala#L83-L90

If I remove the reprojection from line 83, I get an empty output (albeit much faster, likely because it's not reading any data 😅)

diff --git a/api/src/main/scala/package.scala b/api/src/main/scala/package.scala
index 199fce0..010a792 100644
--- a/api/src/main/scala/package.scala
+++ b/api/src/main/scala/package.scala
@@ -80,14 +80,14 @@ package object geoprocessing {
       sources match {
         case head :: Nil => head.some
         case head :: tail =>
-          val reprojectedSources = NonEmptyList.of(head, tail: _*).map(_.reproject(targetCRS))
-          val attributes = reprojectedSources.toList.attributesByName
+          val sources = NonEmptyList.of(head, tail: _*)
+          val attributes = sources.toList.attributesByName

           val mosaicRasterSource =
             if (parallelMosaicEnabled)
-              MosaicRasterSourceIO.instance(reprojectedSources, targetCRS, collectionName, attributes)(IORuntime.global)
+              MosaicRasterSourceIO.instance(sources, targetCRS, collectionName, attributes)(IORuntime.global)
             else
-              MosaicRasterSource.instance(reprojectedSources, targetCRS, collectionName, attributes)
+              MosaicRasterSource.instance(sources, targetCRS, collectionName, attributes)

           mosaicRasterSource.some
         case _ => None
time ./scripts/run_geotrellis examples/LowerSchuylkill.geojson
{}

________________________________________________________
Executed in    6.81 secs      fish           external
   usr time   20.76 millis    0.10 millis   20.65 millis
   sys time   21.00 millis    1.04 millis   19.95 millis

I'll see if I can figure out how to do the resampling only once while still selecting the right data. The above may imply that 6.8s is a timing floor below which we cannot go.