Major memory footprint optimisation

Fonsan commented 8 years ago

When looking at OSM data generally most ways do not actually have the 'ref' or 'name' tag.

http://download.geofabrik.de/north-america/us-northeast.html Of the 7300974 ways in north east us only 1494050 of them had a "route connector". This means a rewrite of the first stage of joining ways only need to hold a subset of the ways in memory at any given time this alone could bring down the memory usage with at least 70%. One could even improve further by doing two passes of the input file. One that builds up the route db only storing the osmid:s and then on the second pass emitting joined routes as they get completed when iterating.

The NonSplittingWayCollector would not need to build any db while running.

When I ran the file above on OS X curvature allocated somewhere north of 35 GB of virtual memory and got stuck in GC and swapping.

I would be happy to give it a shot and perhaps we could follow the "step" design in refactor

adamfranco commented 8 years ago

Even with the new code doing only the collecting step I ran into memory issues processing the japan file (which incidentally doesn't have any sub-regions available) using my CurvatureBuilder pipeline:

5/25/16 12:00:01.558 AM CurvatureBuilder-Process[43863]: Collecting ways for asia/japan ...
5/25/16 4:32:35.253 AM CurvatureBuilder-Process[43863]: Ways collected for asia/japan (1.99 GB), processing curvature...
5/25/16 4:58:02.102 AM CurvatureBuilder-Process[43863]: Processed curvature for asia/japan (2.66 GB).
5/25/16 4:58:02.102 AM CurvatureBuilder-Process[43863]: Dividing data set for asia/japan...
5/25/16 5:00:08.591 AM CurvatureBuilder-Process[43863]: Sorting 10000-plus asia/japan (172.2 MB).
5/25/16 5:00:20.106 AM CurvatureBuilder-Process[43863]: Skipping sort for asia/japan 1000-10000. 976.9 MB data size is larger than our limit of 600 MB.
5/25/16 5:00:34.491 AM CurvatureBuilder-Process[43863]: Skipping sort for asia/japan 300-1000. 1.54 GB data size is larger than our limit of 600 MB.
5/25/16 5:00:57.919 AM CurvatureBuilder-Process[43863]: Sorting complete for asia/japan (2.66 GB). Generating output...
5/25/16 5:09:38.603 AM CurvatureBuilder-Process[43863]: KML output for asia/japan complete.

This was on a quad-core iMac with 8GB of RAM. As you can see from the timing, the streamed post-processing and output takes about 35 min, not nothing, but chugging along. The collecting step takes 5 hours mostly because the machine is heavily swapping after the first few minutes.

One thing I just realized thinking about this, is that we don't actually need the coords until right before the collections are output. The joining is done on ref-id and the coords aren't actually used in the joining process. One optimization might be to collect a batch of collections (either all un-joinable ways, or up to some number) and then add their coords and output them before moving on to working with other ways. Similarly, joining could all be done first, then have the coords loaded, then the coords could be attached to the ways in a collection and the collection outputted (freeing its memory) before coords are attached to the next collection.

adamfranco commented 8 years ago

Another alternative is that the coords could be added as a second step. It would still need to load many full collections into memory, but that could make the batching simpler:

Read in collections and add the refs needed to a hash until the size of the hash grows to a configurable limit (finishing off the last collection completely).
Pause loading collections and read the PBF for coords, loading the coords into the hash.
loop through the loaded collections and attach coords to the ways, popping each collection off the list and outputting it right away so that the memory used for the coords doesn't double as they are copied into the ways.
Continue on reading the next batch of collections.

Fonsan commented 8 years ago

I think we need to define a way of measuring how much data is loaded into memory before we proceed, otherwise we risk making the wrong assumptions on what parts of the data are big and have dependencies.

I think we should keep count of the memory used and max amount of ways and nodes being kept at any one point.

The sharding technique you mention above has problems when we join on name since we do not know if we have loaded all ways for the name. Also if we do multiple passes we need to keep a list of previously emitted roads.

My current thoughts are: For all roads store node dependencies in a manner that they are no longer referenced when all their ways have been emitted in step 3. Also store all link info name , tag['refs'] For all coords save only the ones referenced above For all ways emit joined ways and release nodes

My approach would break for larger datasets as it assumes that all nodes fit in memory.

Your approach does have the advantage that the only data structure that would grow with the dataset is the list of already emitted ways, mine grows with nodes and ways. However it might require a impractical amount of passes. Let's say you only have 1/10 th of the ways in memory then 30 reads of japan would be necessary.

All collectors should have the same output and we might choose to include more than one depending on use case so that the end user can pick algo depending on available hardware

Fonsan commented 8 years ago

I have worked on this problem a bit more and I think it is prudent to split the problem up into two separate parts:

applying coordinates
joining ways

I have started solving the first one by using the lower level api:s in imposm parser to interact directly with the chunks inside a pbf file, it turns out that all coordinates are stored sequentially and emitted in chunks of strictly sequential id:s.

Fonsan commented 8 years ago

I have put in quite a lot of work of creating a coordinate joiner that would exploit the fact that coordinate chunks are sequential by essentially iterating over the way chunks and having a LRUCache with the recently used node chunks in memory. Multi threading it turned out to be a pain but in single thread mode it works, it will stabilise it's memory usage and not grow while iterating over the ways.

Then I stumbled upon this tool https://wiki.openstreetmap.org/wiki/Osmconvert which implements the same strategies I chose and then some. It is also written in C and the guy really seems to know what he is doing.

This means we could create south-japan.poly and north-japan.poly ourselves and split japan into smaller parts the same way geofabrik does.

I have yet to test it, will post my findings later

Fonsan commented 8 years ago

I have successfully cut out more than half of Japan

./osmconvert japan-160524.osm.pbf -b="128,30,144,40" --complete-ways --out-pbf | pv > j3.pbf

854M    pbfs/j3.pbf
1.0G pbfs/japan-160524.osm.pbf

And osmconvert never used more than 1.5G during it's run 5 minute run which is quite impressive. I really think solving the input pbf:s to match the hardware that curvature is running on is really the way to go.

Here is my approach in a broken state https://gist.github.com/Fonsan/4a1d5e7ec2ad57bfed0c72cf51ea93c7

I still think we should extract out a coordinate joiner into a separate step which mostly keeps the naive gobble everything in memory approach.

The second step would be joining roads, which would be trivial to write, and easy to optimise if needed

adamfranco commented 8 years ago

Hi Eric, good find on osmconvert, that takes some of the pressure off of the absolute performance of curvature.

I wonder if there is any benefit in terms of which comes first, joining or adding coordinates? Joining doesn't necessarily need the coordinates and and the coordinate-loading doesn't need the ways to be joined, so there shouldn't be an algorithmic problem with either order.

The only thing I can think of is that both the way reading and the coordinate reading involve the PBF, whereas the joining doesn't necessarily need to know about the input format.

Fonsan commented 8 years ago

The only thing I can think of is that both the way reading and the coordinate reading involve the PBF, whereas the joining doesn't necessarily need to know about the input format.

Yep, also the use case it general enough to warrant a separate project so that others could benefit from that library without being swamped by the full curvature stack.

adamfranco / curvature

Major memory footprint optimisation #19