UDST / pandana

Pandas Network Analysis by UrbanSim: fast accessibility metrics and shortest paths, using contraction hierarchies :world_map:
http://udst.github.io/pandana
GNU Affero General Public License v3.0
385 stars 84 forks source link

[v0.6.dev0] Vectorized shortest-path routes #142

Closed smmaurer closed 3 years ago

smmaurer commented 4 years ago

This PR adds a function for getting the shortest-path routes between many origin–destination pairs at once. It's vectorized and multi-threaded for optimal performance.

This is analogous to the work in PR #136 that vectorized the distance calculations. Thanks to @pavyedav for helpful discussions, code review, and testing.

Performance

With a small-scale network, the vectorized function is about 150x faster than looping through the existing Python function that takes one O–D pair. See shortest_path_example.py for the benchmark code.

This is great, but it's not as fast as the vectorized distance calculations in #136, which had a 1200x improvement. My best guess is that this is because there are node id lookups that still use a Python loop (network.py#L237-L238). So we should try to vectorize these with NumPy before merging -- SEE ADDITIONAL DISCUSSION IN COMMENTS.

The C++ code is also not quite as efficient -- it doesn't allocate all the memory in advance, because we don't know how many nodes each route will involve. I'm guessing this is not as big a deal, though. accessibility.cpp#L106-L120

Discussion

Note that Pandana expresses routes in terms of a sequence of node id's, not edge id's. @pavyedav and I dug into this a bit and it seems to be baked in pretty deeply in the C++. So if you're working with data where the edge id's are important, the best approach is probably to set up a secondary index in Python that identifies the edges by their corresponding nodes, and use that to translate the Pandana route into a sequence of edges.

To do:

chourmo commented 3 years ago

Could the PR add an option to retrieve both path and distance simultaneously ?

Thanks

smmaurer commented 3 years ago

Hi @chourmo, I agree that it would be useful to retrieve both path and distance simultaneously. I'd also like to provide an option for getting the paths as edges rather than nodes, and to associate edge lengths and other metadata.

But at the moment, there's not actually any performance advantage to retrieving path and distance simultaneously rather than one after the other. The distances in particular are super fast. So I'm planning to work on the additional functionality in a separate PR after this one!

smmaurer commented 3 years ago

Following up about the performance of this vs. the vectorized distance calculations. As mentioned in the original writeup, the vectorized distances are about 8x faster than the vectorized paths.

One of the likely reasons for this is that the C++ code uses its own node ids, which need to be mapped back to the Python-side ids when the paths are returned. After digging into this a bit I can't find any easy speed-ups on the Python side, though. Flattening the list of paths and mapping all the nodes at once actually performs similarly to the list comprehension mapping.

In the big picture the performance is still very good -- on my machine the additional overhead of generating paths rather than distances works out to about 1 second per 3 million path nodes. So in the future we might want to explore cutting this down by doing the mapping on the C++ side, but it's probably not a high priority for now.

I'm going to add some unit tests and then get this merged and released.

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.2%) to 89.073% when pulling 87e79a3597755fff79daf5d392e527cf40d72fcc on vectorized-paths into 4dc265a5a343691b9660d8c29fe319f61af029e4 on dev.

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.2%) to 89.073% when pulling 87e79a3597755fff79daf5d392e527cf40d72fcc on vectorized-paths into 4dc265a5a343691b9660d8c29fe319f61af029e4 on dev.