gboeing / osmnx

OSMnx is a Python package to easily download, model, analyze, and visualize street networks and other geospatial features from OpenStreetMap.
https://osmnx.readthedocs.io
MIT License
4.86k stars 824 forks source link

Using parallelism to build graphs #534

Closed zacharyDez closed 4 years ago

zacharyDez commented 4 years ago

Hello!

I am working on a project for my parallel programming class and stumbled upon your library. It has been really easy to work with and expand. I thought that I should get in touch and see if you had already worked on or discussed optimization with parallel programming.

Related problem to the feature proposal

Building the graph from the OSM queries can take a significant amount of time. Optimizing with some parallel programming could reduce the time necessary to build the graph.

Proposed solution

I wanted to get in touch before starting to work on a specific implementation.

Describe alternatives you've considered

Really have not settled on a specific solution. Have experience working with MPI and mpi4py.

The parallelization could be relatively trivial in the sense that we could simply divide the processing of each node and related edges among the different worker processes of function _create_graph.

Additional context

Hope this followed your convention! Let me know if it could be of interest.

gboeing commented 4 years ago

Hi @zacharyDez, thanks for the suggestion. A few questions:

Building the graph from the OSM queries can take a significant amount of time.

What specific parts of the process would lend themselves to parallelization?

MPI and mpi4py

Why these packages instead of the standard threading or multiprocessing modules? My impression was mpi was for systems with distributed memory? Seems like normal multiprocessing would be a more general solution and not introduce another dependency beyond the standard library.

The parallelization could be relatively trivial in the sense that we could simply divide the processing of each node and related edges among the different worker processes of function _create_graph

If you profile the code, I think you'll find that this accounts for a trivial amount of the processing time when constructing graphs. In my previous tests, the vast majority of the run time was consumed by downloading data, doing the point-in-polygon spatial operations, and simplifying the graph topology. These processes may not lend themselves well to parallelization.

In general, I'm open to the idea of parallelization. But I'd first want to see some profiling that identifies the specific time bottlenecks currently, how they'd be improved with parallelization, and if the benefits would be substantial enough to justify additional code complexity. I'm a bit skeptical of that right now.

zacharyDez commented 4 years ago

Currently working on some tests using the multiprocessing library for the standard library. Will perform comparative profiling between the two. Do you have any particular benchmark ideas?

On another subject, I also refactored the code used to create the graph. Mainly extracting the condition to evaluate if the path data is one-way and if the data must be reversed. My changes can be seen here: https://github.com/zacharyDez/osmnx/blob/para_create_graph/osmnx/graph.py. Travis-CI pipeline is green!

Not too sure if you would prefer having a separate PR for the refactor or if you would prefer waiting until I finished the parallelization profiling.

gboeing commented 4 years ago

For the second proposal, why don't you open a new issue to lay out its motivation and document how it improves things?

d3netxer commented 4 years ago

This is an interesting thread. I definitely think parallelization might help with speeding up simplifying the graph topology of large graphs. Looking at the simplification module I think all of the functions could be run in a parallel fashion somehow.

gboeing commented 4 years ago

I definitely think parallelization might help with speeding up simplifying the graph topology of large graphs.

@d3netxer maybe! The question is how much in relation to additional code/complexity that requires my ongoing long-term maintenance. Like I mentioned earlier, before reviewing a PR it would be important to 1) not introduce additional dependencies and 2) profile the code and verify how much speed-up is offered by parallelization across a range of use cases:

In general, I'm open to the idea of parallelization. But I'd first want to see some profiling that identifies the specific time bottlenecks currently, how they'd be improved with parallelization, and if the benefits would be substantial enough to justify additional code complexity.

So, I'm open to the idea but would need to see some numbers first. For example, if parallelization offers a 10% speed-up to the graph_from_whatever() functions, that's not very compelling in the grand scheme of things. Contrarily, if it introduces just 20 lines of code and offers a 30%+ speed-up, that is pretty compelling.

zacharyDez commented 4 years ago

I had to write a report after my attempts (in french). Can't spend too much time explaining the results or translating. However, the graph being a shared resource, the overhead required from synchronization of the multiple processes actually slows down the creation of the graph. Tried with different methods all using the multiprocessing module of the standard library.