Pruned repeated lookups for first & last points in a line

eldang commented 3 years ago

This yields a ~20% speed improvement over the naive implementation I timed first.

dabreegster commented 3 years ago

I haven't been following along with the performance work, but two questions/ideas:

1) Do all of the queries like https://github.com/eldang/elevation/blob/f557f13031a3c935731c7d6642fad3a39e28193d/data.py#L226 internally use a quadtree or some other spatial index? 2) Assuming the contour dataset fits in memory and we're not IO-bound, could all of the road segment queries be calculated in parallel?

eldang commented 3 years ago

@dabreegster I'll give you a quick tour of the data flow as a way to answer those questions. But I'll reference the main contour-lookups branch because that will give you the current state of play, which has some improvements over what's in this one.

The DataSource initialiser reads the whole data source, and then stores all the contours which intersect with the input file's bounding box as the Geopandas GeoDataFrame self.gdf at https://github.com/eldang/elevation/blob/2087148b4fcbb7e3206369b02c29902aa716eb5a/data.py#L139-L142

Then after a little more housekeeping (field renaming, converting feet to metres) it stores a spatial index of that gdf as self.idx at https://github.com/eldang/elevation/blob/2087148b4fcbb7e3206369b02c29902aa716eb5a/data.py#L157

I leave it to Geopandas to decide what sort of index to use, but I think in practice it uses the GEOS dependency to make an STRTree and access that through the compiled GEOS binary so it's faster than interpreted Python would be.

Queries against self.idx force index use, and return just a list of row reference[s] that self.gdf can then be subsetted by.

However, the line you called out above does turn out to have been a mistake. Querying the index for a whole line is a costly operation because in practice it causes each pair of coordinates to be checked in turn. I thought it would make the subsequent point lookups faster by letting them check a smaller set of contours, but I think any potential gain from that was merely duplicating the gains that we were already getting for free from the index. At least, I think that's why dropping that part of the logic in https://github.com/eldang/elevation/pull/4 gave us a much bigger speedup after the changes in this one.

I think parallelising this will be a good idea. I started serial because it's easier to conserve row order between input and output files, but that's not actually going to be that hard to manage. For a file the size of the Montlake sample I've been working with the script spends about 1 minute reading input & data and subsetting & indexing the data, followed by 3 minutes of lookups so it wouldn't necessarily be worth the complexity. But for larger input files I would expect the lookups to scale linearly with file size, and the initial file read time to barely increase, so looking up n lines in parallel would quickly add up to a big speed improvement.

Do you have a preference between trying that first or building the SRTM lookup algorithm first?

dabreegster commented 3 years ago

Thanks for the explanation about how this works!

Do you have a preference between trying that first or building the SRTM lookup algorithm first?

SRTM -- correct/complete first, then fast. :)

conserve row order

I'm not familiar with the Python multiprocessing libs, but in case it's helpful, https://github.com/a-b-street/abstreet/blob/9f2737b2b572fc12143166256d771ef623c8ab56/abstutil/src/time.rs#L352 is how I've structured things like this before.

1) Create a multi-producer, single-consumer channel to communicate between threads 2) Farm out the work to a thread pool. Each work item is an (index, request) pair, where the index comes from the original order in the input file. 3) When the work is done, push (index, response) onto the transfer half of the channel 4) On the original thread, read the receiver half of the channel until it's done. For each (index, response) pair, just populate that result in a list. The requests can be calculated in any order; the index holds onto the order.

This keeps everything in memory, but that should be fine for our workload sizes. If we hit memory limits, I can just invoke the tool repeatedly with a batch size.

eldang commented 3 years ago

Cool, that's how I'll proceed then: SRTM as I was planning, and then parallelising once that works.

That Ruby example looks similar to how Python's multiprocessing module is structured, possibly with a more concise syntax. And yeah, indexing is all this should require. I think that Python will be able to share one read-only copy of the data source and its index, in which case the memory impact from a parallel run won't be that different from a serial run. The things that each thread will need its own copy of should be very small data structures, like lists of row indices for the individual contours. And a raster lookup should be smaller still, though those may also be fast enough that parallelisation also buys us less.

eldang / elevation_lookups

Pruned repeated lookups for first & last points in a line #3