Parallelisation (all lookups)

eldang commented 3 years ago

Updated: now parallelises both kinds of lookup

Elevation lookups are now parallelised
It spawns 5 threads by default, simply because that number is 1 less than the number of cores available on my laptop. This can be overriding with a command line argument
Doing contour lookups in 5 parallel threads roughly halves processing time for the Montlake sample. The return will be much bigger for larger map areas, because loading the Seattle contours file in the first place takes about a minute.
Doing raster lookups in 5 parallel threads has a better return: Anchorage went from 83 seconds to 24 in my test run.
It does preserve row order between input and output.

dabreegster commented 3 years ago

105s with 16 threads. Initially reading the geojson dominated that; once the threads were up, it was about 1s. Nice! Still nowhere near lidar's 11s, so I'm likely going to keep preferring that one.

eldang commented 3 years ago

Initially reading the geojson dominated that;

Yeah, that seems to be a big limitation with vector data here. Though if you were to do a much larger area, say all of Seattle, in one batch it probably wouldn't matter so much. And that in turn has me wondering if I can do a windowed read of the contours file instead of reading the whole thing and then cropping.

dabreegster commented 3 years ago

And that in turn has me wondering if I can do a windowed read of the contours file instead of reading the whole thing and then cropping.

GeoJSON doesn't have any indexing -- but I'm guessing you mean that currently the entire file is read, converted to an in-memory internal representation of the geojson, and then clipped. I don't know about in Python / the GDAL ecosystem, but I haven't seen many streaming parsers like this before in any language I've worked with.

Although the Seattle 2ft contours are a neat data-set, I wouldn't personally choose to prioritize them, since they're Seattle-specific, so much slower than raster methods, and don't seem to have a big quality difference (although I'm still not sure how to measure that).

eldang commented 3 years ago

Yeah, your guess is correct. It might save time to read just a window, but because that's not as straightforward to do as the equivalent for a raster it may also be the case that reading the window is itself slow without first having indexed the whole file. Either way, I'll shove this into the "things I might want to explore outside the scope of the A/B Street contract" bucket, along with seeing if I can bring elevation's dependencies up to date, and finding better-than-SRTM data sources for Vancouver Island....

eldang commented 3 years ago

To be honest, if I'd realised at the outset how much faster raster processing was going to be, and how high the quality of that Puget Sound LIDAR data was, I'd have left all vector processing in that category. But I naïvely thought that the contours would give us significantly better results at comparable speeds.

dabreegster commented 3 years ago

Using lidar, I re-ran for montlake with n_threads=1 and got 4.5s and 4.1s -- something must've gotten faster since the 11s I measured in #12, or I didn't repeat the runs there. For downtown, n=1, 28s.

I then ran with n=16, the number of cores on my machine, and it promptly melted. Each thread sucks memory. After a hard reboot, trying n=2... for montlake, it takes 28s. The run for downtown stalls like this:

20210326 20:00 INFO:    Found 5178 rows in input/query
20210326 20:00 INFO:    Area covered: (-122.3803598614818, 47.59760147467252, -122.30276231424337, 47.63011349397687)
20210326 20:00 INFO:    Using data source: King County 2016 LIDAR, block 1 (Seattle & Lake Washington)
20210326 20:00 INFO:    Data file already saved at data/kc_2016_lidar.tif
20210326 20:00 WARNING: Existing output/query will be overwritten
20210326 20:00 INFO:    Spawning 2 threads
Thread 1 processed 2616 lines
Thread 0 processed 2562 lines
20210326 20:00 INFO:    Writing output to output/query
20210326 20:00 INFO:    Run complete in 5 seconds.

That looks like a nice speedup! But I think the main thread has some issue detecting the work is done, since it just hangs.

I'll experiment with other maps (aka query sizes) and numbers of threads. This one scales much less obviously.

eldang commented 3 years ago

Hrm, that stall is worrying. I've only encountered similar behaviour when I've run it with significantly more threads than my laptop has cores, and I'd assumed it was from individual threads crashing. In theory, once the thread emits Thread x processed y lines it should be done, and in theory the script shouldn't even start writing the output until all threads are done. But I do think a zombie process is the likeliest reason for a hang.

eldang commented 3 years ago

@dabreegster Today I have learned a lot about the intricacies of Python child processes. I've added a few different safeguards:

Child processes are now spawned as daemons, which should in theory get cleaned up when sys.exit() is called at the end of a run.
Once the parent process has finished collecting all the child processes' output, it explicitly issues SIGTERM to each, and then calls each one's close() method.
The parent process explicitly checks that it's retrieved as many rows of output as it should get.

1 & 2 are theoretically redundant with each other, but I haven't noticed any detrimental effects from having both.

3 addresses a failure mode that I've only actually encountered when running ludicrous numbers of threads for testing performance under weird conditions: it is possible for the input queue to be marked as complete before all child processes have finished putting their output into the output queue. But as long as that's theoretically possible, I'd rather have an explicit check for it.

I'm looking into the memory usage now.

eldang commented 3 years ago

Aha, I see the memory problem. Each child raster process loads a full copy of the TIFF into memory. Unlike for vector data, windowed raster loads are relatively straightforward, so I'll implement that.

eldang commented 3 years ago

OK! Now any input raster gets subsetted in the way that I already had set up for SRTM. I should have thought to do this in the first place, because when the input file's bounds are a small subset of the data source's, like elevation_montlake :: the Seattle LIDAR file, this gets us two benefits:

Each child process uses far less RAM. In that specific example, each process uses < 100MB, whereas it was > 5GB before I made this change.
The child processes are correspondingly faster to initialise.

I think I've now addressed all the performance and reliability issues, so please test away and let me know how it goes.

dabreegster commented 3 years ago

The raster subsetting is awesome! Now with n=1, montlake is down to 2.7s and downtown to 8.8s. With n=2, I'm getting a crash:

20210327 04:04 INFO:    Spawning 2 threads
Thread 1 processed 388 lines
Thread 0 processed 357 lines
20210327 04:04 INFO:    Removing temp data file data/kc_2016_lidar.tif_1616817891.0654712_temp.tif
Traceback (most recent call last):
  File "main.py", line 114, in <module>
    main()
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.6/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "main.py", line 85, in main
    infile.tag_elevations(d, outfile, n_threads)
  File "/elevation/files.py", line 99, in tag_elevations
    vals: List[ElevationStats] = d.tag_multiline(self.__paths, n_threads)
  File "/elevation/data.py", line 357, in tag_multiline
    workers[i].close()
AttributeError: 'Process' object has no attribute 'close'

eldang commented 3 years ago

How odd.... I'm a little worried that this might be some quirk of platform-specific variants of the multiprocessing interface, but before going down that rabbit hole we should see if I've made a programming error. Can you try this with a couple of other values of n, and if it consistently reproduces please run it with --log=DEBUG and post the whole console output? In no hurry - I probably won't attempt a fix before Monday.

dabreegster commented 3 years ago

It's consistently happening. This is with n=4: log.txt I'll try commenting out some of the Python code to see what happens.

Also, with n=1, the biggest input, huge_seattle, only takes 109s for lidar. That's awesome -- before the raster windowing stuff, I think it was around 7 or 8 minutes.

dabreegster commented 3 years ago

Aside from the 2 comments I left, this LGTM! Results for huge_seattle lidar:

n=1: 108s n=2: 72s n=3: 66s n=4: 50s n=5: 41s n=6: 40s n=8: 48s and almost out of memory

Even the n=1 time is totally acceptable for my uses of this tool, so awesome! Windowing really made a difference.

Next up for me: getting the SRTM to work in Docker finally. I found https://hub.docker.com/r/osgeo/gdal/dockerfile, which seems to set up both proj and gdal using a more modern ubuntu -- I'm going to try basing off of that.

eldang / elevation_lookups

Parallelisation (all lookups) #16