UDST / pandana

Pandas Network Analysis by UrbanSim: fast accessibility metrics and shortest paths, using contraction hierarchies :world_map:
http://udst.github.io/pandana
GNU Affero General Public License v3.0
386 stars 84 forks source link

`precompute` memory consumption #107

Closed knaaptime closed 1 month ago

knaaptime commented 5 years ago

In the past, i've been able to create a pandana network and precompute moderately-sized queries on a laptop (e.g. the linked example precomputes 8000m on an osm network covering the MD-DC-VA MSA).

Using the current version, net.precompute() is consuming tons of memory, often eating up everything on the system. For example, with a network slightly larger than Denver county, the following will eat up all the memory on a linux box with 64gb RAM and crash the process. The same also happens on my macbook.

bbox = (-105.20368772, 39.54191854, -104.50619504, 39.98674731)
net = osmnet.network_from_bbox(bbox=bbox)
network = pdna.Network(net[0]["x"], 
                       net[0]["y"], 
                       net[1]["from"], 
                       net[1]["to"],
                       net[1][["distance"]])
network.precompute(8000)

On the same pdna.network, calling precompute(5000) consumes 40gb of ram

If I don't precompute, I'm able to perform the accessibility queries with hardly any resource consumption (albeit much more slowly, of course).

Any idea what could be happening?

Environment

json 2.0.9 numpy 1.16.2 pandana 0.4.1 osmnet 0.1.5 pandas 0.24.2 compiler : GCC 7.3.0 system : Linux release : 4.18.0-16-generic machine : x86_64 processor : x86_64 CPU cores : 12 interpreter: 64bit

fscottfoti commented 5 years ago

I can't imagine why precompute would be any different. I don't think that code has been touched in ages. Only thing I can think of to set twoway=False and see how much of a difference that makes?

knaaptime commented 5 years ago

🤷‍♂️ that's what i figured, and couldn't see any reason things would be different now. but i can confirm this behavior using the code above in a new conda environment with pandana from the udst channel

unfortunately not seeing any change with twoway=False

svx3 commented 5 years ago

I'm seeing the same issue. I can run an aggregate over the same distance, and while it does take a long time (30 mins), it does complete without killing the kernel.

knaaptime commented 5 years ago

sorry for the circular references, but #104 isn't to blame for this, because i can reproduce using the pre-compiled versions from pip/anaconda

d-wasserman commented 5 years ago

To add on to this, I am also running into this issue with pre-compute.

federicofernandez commented 4 years ago

I did a careful analysis of pandana's memory consumption in the precompute step. The conclusions so far are that the memory usage is in line with the data structures that we are storing in memory.

I did my tests with a network with 685K nodes (and around 1M edges). Memory consumption (interpreting this graph as directed) is around 7 to 8 GB in the precompute phase.

In the other hand, doing some math on the data structures, we can see that the precompute method basically creates a collection of std::vectors where the value is another std::vector specifying the target node (as an unsigned int) and a float representing the distance. That’s the dms member of the Accessibility class.

In this example, the reachable nodes are around 1179 in average, given that the data structure has 808 millions of elements. Each element is a pair of (uint, float) that in the tested architecture means 4 + 4 bytes.

In conclusion, the size of the created data structure is in theory 808M * 8 bytes = 6.4 GB.

That is very close to the observed 7 GB, and given the alignment issues and space occupied by the std::vectors themselves, it’s a very reasonable memory consumption.

So the conclusion is that I’m not seeing any memory explosion in the example that I’m following (that is a very big one). It is just using a reasonable amount of memory taking into account the input size.

d-wasserman commented 4 years ago

I guess from a user perspective this seems to be a new introduction to the library, but thinking on it it might just be non-linear growth in data consumption at large bandwidths (just by the nature of how many nodes are contacted over larger time periods).

Thanks for taking the time to look at this. If I get a chance to experiment with this more I will report back.

knaaptime commented 4 years ago

agreed, thank you for digging into this