WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Collections API: RasterLinesJoin improvements #72

Open kellyi opened 6 years ago

kellyi commented 6 years ago

Spoke with @lossyrob a bit about how we might improve the performance of RasterLinesJoin and he pointed out two optimizations we could make:

Reducing the number of stream lines looped over per tile

Currently we loop over the whole set of MultiLines for each tile here: https://github.com/WikiWatershed/mmw-geoprocessing/blob/develop/api/src/main/scala/Geoprocessing.scala#L106

However, we could consider looping over only the subset of lines which actually intersect the tile. Depending on how many tiles are there for an AOI, this would reduce the number of times the lines loop executes since we'd only be dealing with lines with actual values.

We'd have to check whether improvements here would be offset by, presumably, looping over the lines to do the intersection operation before that.

Using Lines rather than MultiLines

Currently we do some processing on the input to transform the input stream vectors into MultiLines: https://github.com/WikiWatershed/mmw-geoprocessing/blob/develop/api/src/main/scala/Utils.scala#L120

However, apparently the MultiLines are unspooled by GT into Lines, so we could flatmap the stream vectors into a Seq[Line] and then try using something like a forEachByLineString method in the loop.