Osrm-extract throw segmentation fault on planet by foot profile

jbalcar commented 1 year ago

Hello,

I'm extracting full planet by osrm-extract command with foot profile but it everytime failed with Segmentation fault (error 4) on Generating edge-expanded graph representation.

I have correct planet.osm.pbf file checked by md5sum from 2023-05-01. Application is build by commands used on main osrm-backend site. Version of application is latest - 5.27.1.

But when I tried it with Slovakia.osm.pbf (small country) it run correctly.

I have table PC with this parameters:

Ubuntu 22.04
32GB RAM
24 core AMD ThreadRipper
4x96GB swap file on separate SSD disks
606GB of free space SSD disk for result of extraction

I also started swap and disk space monitoring with the process, but they seem to be large enough.

On last step I build application in debug mode and here is end of result:

[2023-05-12T19:33:04.060745667] [debug] maneuver override references invalid way: 489669347 [2023-05-12T19:33:04.061104121] [debug] Maneuver override references invalid node: 44140776 [2023-05-12T19:33:04.061112561] [debug] maneuver override references invalid way: 489669347 [2023-05-12T19:33:04.061119071] [debug] Maneuver override references invalid node: 44140776 ok, after 0.014752s [2023-05-12T19:33:04.061131992] [info] Collecting node information on 0 restrictions...ok, after 0s [2023-05-12T19:33:04.061146522] [info] writing street name index ... ok, after 28.1941s [2023-05-12T19:33:32.665080440] [info] extraction finished after 77779.1s [2023-05-12T19:53:46.051291680] [info] Generating edge-expanded graph representation [2023-05-13T02:15:42.541460564] [info] [assert][139801124636480] /home/jbalcar/osrm-backend/src/extractor/graph_compressor.cpp:125 in: void osrm::extractor::GraphCompressor::Compress(const std::unordered_set&, const osrm::extractor::TrafficSignals&, osrm::extractor::ScriptingEnvironment&, std::vector&, std::vector&, osrm::util::NodeBasedDynamicGraph&, const std::vector<osrm::extractor::NodeBasedEdgeAnnotation, std::allocator >&, osrm::extractor::CompressedEdgeContainer&): SPECIAL_EDGEID != forward_e1 terminate called without an active exception Aborted

Where can be a problem? Thanks a lot

pxpeterxu commented 1 year ago

I can confirm that we've reproduced this at Wanderlog with the latest foot profile. I've attached the full logs from our (non-debug) version, and have also verified that the car profile builds fine with planet.pbf

osrm-build-foot.txt

The command we used for this are just the standard:

osrm-extract -p osrm-backend/profiles/foot.lua planet.osm.pbf

Our specs are:

Ubuntu 22.04 LTS 512GB of memory, 384 GB of swap on a NVMe drive 48 core AMD EPYC CPU

I'm running another build to see if the same happens with Ubuntu 20.04 LTS and with an Intel CPU, just in case it's somewhat platform-dependent, and will update if we get that.

Edit (2023-05-17): I've confirmed that OSRM seems to error on Ubuntu 20.04 LTS as well on an Intel-based server, so it seems like this is hardware-independent and likely due to some quirk in the latest planet.osm.pbf

RZR-UA commented 1 year ago

I can confirm the problem.

~2 monts ago, osrm-extract starts crashing on planet data.

My HW is E5-2680 with 500 GB RAM + 400 GB swap.

osrm-extract v5.27.1 (also tried with older version)

Current workaround is repacking planet-data with osmosis before osrm-extract:

time osmosis --rb file="$(echo planet-*.osm.pbf)" -wb file="cleaned-planet.osm.pbf" # planet : 209 minunes

PS. Additionally, I remove "404 country" from planet data using osmosis, but it seems just repacking without any modifications will sort out the problem with osrm-extract crash

nilsnolde commented 1 year ago

Ouch, so it's a data issue? Where is the planet data coming from? Was it a fresh copy or a file which was incrementally updated over time?

It'd be super valuable if we knew the way OSM ID where this is happening, but that'd require a log statement or a run with gdb.

RZR-UA commented 1 year ago

I download planet data weekly via bittorrent: hxxps://planet.osm.org/pbf/planet-latest.osm.pbf.torrent

This is full fresh data sized ~68 GB pbf.

With osmosis, it turned into ~59 GB pbf.

Now, I've succeeded with all 3 default profiles: car, foot, bike

pxpeterxu commented 1 year ago

@nilsnolde I've gotten a server with gdb stopped at the segmentation error, by running:

gdb osrm-extract
catch throw
run -p osrm-backend/profiles/foot.lua /path/to/data/planet.osm.pbf

It'd be super valuable if we knew the way OSM ID where this is happening, but that'd require a log statement or a run with gdb.

I could use some help teasing out the OSM ID: here's the backtrace, with the failed assertion in https://github.com/Project-OSRM/osrm-backend/blob/master/src/extractor/node_based_graph_factory.cpp#L49. What GDB commands can I run to get the OSM ID in this case? The tricky part is that all the IDs/edges accessible in this scope are the post-compression IDs.

(gdb) bt
#0  0x00005555556177f5 in operator() (__closure=__closure@entry=0x7fffffffc550) at /root/osrm-backend/src/extractor/node_based_graph_factory.cpp:49
#1  0x0000555555618625 in osrm::extractor::NodeBasedGraphFactory::BuildCompressedOutputGraph (this=this@entry=0x7fffffffcbd0, edge_list=std::vector of length 2170834838, capacity 2172361565 = {...})
    at /root/osrm-backend/src/extractor/node_based_graph_factory.cpp:49
#2  0x0000555555618fb0 in osrm::extractor::NodeBasedGraphFactory::NodeBasedGraphFactory (this=this@entry=0x7fffffffcbd0, scripting_environment=..., turn_restrictions=std::vector of length 0, capacity 0,
    maneuver_overrides=std::vector of length 37, capacity 64 = {...}, traffic_signals=..., barriers=..., coordinates=..., osm_node_ids=..., edge_list=std::vector of length 2170834838, capacity 2172361565 = {...}, annotation_data=...)
    at /root/osrm-backend/src/extractor/node_based_graph_factory.cpp:29
#3  0x000055555559818a in osrm::extractor::Extractor::run (this=this@entry=0x7fffffffda50, scripting_environment=...) at /root/osrm-backend/src/extractor/extractor.cpp:231
#4  0x000055555558bb62 in osrm::extract (config=...) at /root/osrm-backend/src/osrm/extractor.cpp:15
#5  0x000055555557bd91 in main (argc=4, argv=0x7fffffffe068) at /root/osrm-backend/src/tools/extract.cpp:192

Here's some commands that I've tried:

(gdb) print nbg_edge_id
$17 = 19531
(gdb) print compressed_output_graph.GetEdgeData(nbg_edge_id)
$13 = (osrm::util::NodeBasedEdgeData &) @0x5555655f7644: {weight = {__value = 56}, duration = {__value = 56}, distance = {__value = 7.79573011}, geometry_id = {id = 0, forward = 0}, reversed = true, flags = {forward = 1 '\001',
    backward = 0 '\000', is_split = 0 '\000', roundabout = 0 '\000', circular = 0 '\000', startpoint = 1 '\001', restricted = 0 '\000', road_classification = {motorway_class = 0 '\000', link_class = 0 '\000', may_be_ignored = 0 '\000',
      road_priority_class = 10 '\n', number_of_lanes = 2 '\002'}, highway_turn_classification = 0 '\000', access_turn_classification = 0 '\000'}, annotation_data = 13531483}

nilsnolde commented 1 year ago

Oh sorry, I must've made the impression as if I know the code base :sweat_smile: not yet unfortunately.. However, the constructor seems to accept osm_node_ids, so I'd imagine you should be able to get one out of there. Someone else with a dev setup for OSRM might be able to chime in more, sorry for that.

EDIT: I'm mostly interested if it's a data issue and what that is. Would affect other routers too and I'm maintaining one of them.

mjjbell commented 1 year ago

If this is a data quality issue, one way of making it easier to reproduce is to see if you can trigger it on a smaller section of the planet, recursing until you have a manageable test case.

However, both examples are showing an assertion failing when comparing to SPECIAL_EDGEID, so it looks more like a planet-scale overflow problem.

danpat commented 1 year ago

It's suspicious that it's the foot profile - that profile includes the largest number of edges. EdgeId is a uint32_t, so if it turns out we've got > 2^32 edges being generated on the foot profile, then yeah, overflow is quite likely.

An older possibly related ticket where osrm-contract has a similar issue: https://github.com/Project-OSRM/osrm-backend/issues/6169 - there's probably a whole cluster of overflow problems that have been creeping up given the continuing growth in OSM.

@jbalcar The quick workaround is to trim out bits of the planet you don't need - I know this isn't the best answer, but fixing the core bug might take a while.

RZR-UA commented 1 year ago

Today, the problem still exists, even with osmosis pre-conversion.

My solution with osmosis was totally wrong.

Conversion "planet" to "cleaned-planet" with osmosis WITHOUT "--bounding-polygon" leads to same segfault on foot/bike profiles, crashes on osrm-extract (indeed, the number of edges after osmosis is the same).

When I specified --bounding-polygon (as I thought it was full planet data), the OSRM binaries were well done, but actually the routing was broken because after osmosis the planet was somewhat "cut".

I've used this polygon file but it leads to broken routing:

Planet
World
  90 180
  90 -180
  -90 -180
  -90 180
END
END

Finally, I'm trying to find some way to reduce number of edges for Planet-scale data.

With today's OSM Full Planet data, OSRM definitely crashes on bike or foot profile.

Please help me find some kind of workaround to fix the problem.

nilsnolde commented 1 year ago

I've used this polygon file but it leads to broken routing:

That's the whole planet though isn't it? Doesn't do much filtering :) EDIT: ah sorry, just reading that it's broken. Probably a syntax issue? Don't know osmosis much..

There's no workaround other than

going through the whole codebase and update everything up to e.g. 64 bit amount of edges/nodes, can't say how much work that'd be, but shouldn't be too hard I guess
decrease the amount of edges/nodes by eliminating countries which are not interesting to you (which might not buy you all that much time before it hits again)

RZR-UA commented 1 year ago

The problem with my workaround has just solved.

Current solution is:

1) use osmosis with correct Planet-polygon (see below)

2) remove some country(-ies) you don't need anymore (for example, ruzzia, etc)

3) if you don't have 700+ GB RAM or super fast swap (nvme), limit the threads to 25% of default (option -t)

Finally, my Planet was re-built successfully in ~108 hours (car, foot, bike profiles).

Some stats:

RAM: [info] RAM: peak bytes used: 458,707,828,736 (but actually 500 GB + 220 GB swap were used)

Edges loaded: [info] Loaded edge based graph: 3,336,178,894 edges, 819,849,166 nodes

PS. my mistake was wrong lat/lon sequence of Planet-polygon

Correct is:

Planet
World
  180 90
  -180 90
  -180 -90
  180 -90
END
END

Project-OSRM / osrm-backend

Osrm-extract throw segmentation fault on planet by foot profile #6625