WikiWatershed / mmw-geoprocessing

A Spark Job Server job for Model My Watershed geoprocessing.
Apache License 2.0
6 stars 6 forks source link

Only download tiles once #9

Closed jamesmcclain closed 9 years ago

jamesmcclain commented 9 years ago

Instead of downloading the tiles once for every (multi)polygon in the input list, the tiles are now only downloaded once.

This was achieved by reorganizing the code so that the loop over the RDDs of tiles is the outer-most loop.

Connects #7

To test

  1. Setup and run spark-jobserver (see the "old instructions" section of the P/R description of https://github.com/WikiWatershed/model-my-watershed/pull/713 for instructions).
  2. Repeat the following steps for this branch, well as dbbbc18f4de (which is the current head of develop at the time of writing).
    1. Compile the code by typing sbt compile assembly from within the directory.
    2. Submit the jar to SJS:
      • curl --data "" 'http://localhost:8090/contexts/summary-context'
      • curl --data-binary @summary/target/scala-2.10/mmw-geoprocessing-assembly-0.1.1.jar 'http://localhost:8090/jars/summary'
    3. Run the following command five or six times to warm up your JVM: curl --data-binary @examples/request.json 'http://localhost:8090/jobs?sync=true&context=summary-context&appName=summary&classPath=org.wikiwatershed.mmw.geoprocessing.SummaryJob'
    4. Now run the same command a few times, this time taking timings: time curl --data-binary @examples/request.json 'http://localhost:8090/jobs?sync=true&context=summary-context&appName=summary&classPath=org.wikiwatershed.mmw.geoprocessing.SummaryJob'. On real hardware, this typically takes a little under one second for me.
    5. Now attempt a query that asks for the histograms of seven identical polygons: time curl --data-binary @examples/request-7x.json 'http://localhost:8090/jobs?sync=true&context=summary-context&appName=summary&classPath=org.wikiwatershed.mmw.geoprocessing.SummaryJob'. The old version of the code will take more than 10 seconds, so the synchronous request will time out. On real hardware, the new version of the code will complete in about one second.

Another way to test Overwrite the old geoprocessing jar in the /opt/geoprocessing directory of the worker VM with the jar generated by this code, then reload the worker and try the site.

mmcfarland commented 9 years ago

My testing approach so far:

What I see is that 2 additional tile requests are made for each BMP added, on top of some baseline number of requests for the AoI in general. I expected the number of requests to be constant across any number of additional polygons.

I'm not sure if this testing approach is accurate:

I'm mostly concerned about the idea that I'm testing the old jar - any ideas to force a jar, or otherwise verify that the new code is in there?

mmcfarland commented 9 years ago

Another strange item from the logs: all of the "tile" requests I'm seeing in the log are the same (the baseline and the additional 2 per modification polygon)

Sending Request: GET https://com.azavea.datahub.s3.amazonaws.com /catalog/nlcd-wm-ext-tms/11/922000

Not sure if that's just referencing a bucket and not an object or what. My AoI is very small so shouldn't be downloading very many tiles.

jamesmcclain commented 9 years ago

Hmm .... I think one thing to do is delete SJS's cache directory (where it keeps downloaded jars and cached results).

mmcfarland commented 9 years ago

Realized that the 2 requests per modifications are 1 for fake-soil (read NLCD) and 1 for nlcd (read NLCD).

mmcfarland commented 9 years ago

@jamesmcclain do you know the path for that directory off hand?

jamesmcclain commented 9 years ago

I just emptied (but not deleted) /opt/spark-jobserver/filedao and /opt/spark-jobserver/jars. I haven't tested it yet

jamesmcclain commented 9 years ago

Realized that the 2 requests per modifications are 1 for fake-soil (read NLCD) and 1 for nlcd (read NLCD).

If I had a better understanding of how to translate the filenames to locations it would be more clear what is happening. It is possible that all results came from the old jar because (as Hector mentioned) I think SJS's cached version of the jar needs to be replaced or removed in addition to the one in /opt/geoprocessing.

Tomorrow I am going to setup the server on my host and point my worker VM there, as I did when first working on this, to make sure that I am communicating with the correct jar.

I am certain that the timing results that I gave above are accurate though, so that is at least something.

jamesmcclain commented 9 years ago

Bump the version of mmw-geoprocessing to 0.1.2 and compile

I overlooked this before. In order for this to work, you must also edit this line https://github.com/WikiWatershed/model-my-watershed/blob/develop/src/mmw/mmw/settings/base.py#L362 (and possibly others)

mmcfarland commented 9 years ago

Thanks @jamesmcclain and @hectcastro for helping me understand how SJS job versioning works. Once I got that sorted out, I can confirm that the number of tiles remains constant against the number of modifications added to the AoI and that the previous behavior was to request more tiles.

:+1: Nice improvement in a concise bit of code.

I think we should cut a 0.1.2 release for this, and am interested in the mechanism to do so (I made local changes to that effect, not sure if there are additional publishing steps beyond that).

jamesmcclain commented 9 years ago

For me, the easiest way to test is to overwrite these two files:

/opt/geoprocessing/mmw-geoprocessing-0.1.1.jar
/opt/spark-jobserver/filedao/data/geoprocessing-0.1.1-2015-09-30T22_11_19.044Z.jar

on the worker VM and reload it. (The second file could well have a different name.)

When this is combined with https://github.com/WikiWatershed/model-my-watershed/issues/870 we should start to see a real difference.