Open jbalcar opened 1 year ago
I can confirm that we've reproduced this at Wanderlog with the latest foot profile. I've attached the full logs from our (non-debug) version, and have also verified that the car profile builds fine with planet.pbf
The command we used for this are just the standard:
osrm-extract -p osrm-backend/profiles/foot.lua planet.osm.pbf
Our specs are:
Ubuntu 22.04 LTS 512GB of memory, 384 GB of swap on a NVMe drive 48 core AMD EPYC CPU
I'm running another build to see if the same happens with Ubuntu 20.04 LTS and with an Intel CPU, just in case it's somewhat platform-dependent, and will update if we get that.
Edit (2023-05-17): I've confirmed that OSRM seems to error on Ubuntu 20.04 LTS as well on an Intel-based server, so it seems like this is hardware-independent and likely due to some quirk in the latest planet.osm.pbf
I can confirm the problem.
~2 monts ago, osrm-extract starts crashing on planet data.
My HW is E5-2680 with 500 GB RAM + 400 GB swap.
osrm-extract v5.27.1 (also tried with older version)
Current workaround is repacking planet-data with osmosis before osrm-extract:
time osmosis --rb file="$(echo planet-*.osm.pbf)" -wb file="cleaned-planet.osm.pbf" # planet : 209 minunes
PS. Additionally, I remove "404 country" from planet data using osmosis, but it seems just repacking without any modifications will sort out the problem with osrm-extract crash
Ouch, so it's a data issue? Where is the planet data coming from? Was it a fresh copy or a file which was incrementally updated over time?
It'd be super valuable if we knew the way OSM ID where this is happening, but that'd require a log statement or a run with gdb.
I download planet data weekly via bittorrent: hxxps://planet.osm.org/pbf/planet-latest.osm.pbf.torrent
This is full fresh data sized ~68 GB pbf.
With osmosis, it turned into ~59 GB pbf.
Now, I've succeeded with all 3 default profiles: car, foot, bike
@nilsnolde I've gotten a server with gdb stopped at the segmentation error, by running:
gdb osrm-extract
catch throw
run -p osrm-backend/profiles/foot.lua /path/to/data/planet.osm.pbf
It'd be super valuable if we knew the way OSM ID where this is happening, but that'd require a log statement or a run with gdb.
I could use some help teasing out the OSM ID: here's the backtrace, with the failed assertion in https://github.com/Project-OSRM/osrm-backend/blob/master/src/extractor/node_based_graph_factory.cpp#L49. What GDB commands can I run to get the OSM ID in this case? The tricky part is that all the IDs/edges accessible in this scope are the post-compression IDs.
(gdb) bt
#0 0x00005555556177f5 in operator() (__closure=__closure@entry=0x7fffffffc550) at /root/osrm-backend/src/extractor/node_based_graph_factory.cpp:49
#1 0x0000555555618625 in osrm::extractor::NodeBasedGraphFactory::BuildCompressedOutputGraph (this=this@entry=0x7fffffffcbd0, edge_list=std::vector of length 2170834838, capacity 2172361565 = {...})
at /root/osrm-backend/src/extractor/node_based_graph_factory.cpp:49
#2 0x0000555555618fb0 in osrm::extractor::NodeBasedGraphFactory::NodeBasedGraphFactory (this=this@entry=0x7fffffffcbd0, scripting_environment=..., turn_restrictions=std::vector of length 0, capacity 0,
maneuver_overrides=std::vector of length 37, capacity 64 = {...}, traffic_signals=..., barriers=..., coordinates=..., osm_node_ids=..., edge_list=std::vector of length 2170834838, capacity 2172361565 = {...}, annotation_data=...)
at /root/osrm-backend/src/extractor/node_based_graph_factory.cpp:29
#3 0x000055555559818a in osrm::extractor::Extractor::run (this=this@entry=0x7fffffffda50, scripting_environment=...) at /root/osrm-backend/src/extractor/extractor.cpp:231
#4 0x000055555558bb62 in osrm::extract (config=...) at /root/osrm-backend/src/osrm/extractor.cpp:15
#5 0x000055555557bd91 in main (argc=4, argv=0x7fffffffe068) at /root/osrm-backend/src/tools/extract.cpp:192
Here's some commands that I've tried:
(gdb) print nbg_edge_id
$17 = 19531
(gdb) print compressed_output_graph.GetEdgeData(nbg_edge_id)
$13 = (osrm::util::NodeBasedEdgeData &) @0x5555655f7644: {weight = {__value = 56}, duration = {__value = 56}, distance = {__value = 7.79573011}, geometry_id = {id = 0, forward = 0}, reversed = true, flags = {forward = 1 '\001',
backward = 0 '\000', is_split = 0 '\000', roundabout = 0 '\000', circular = 0 '\000', startpoint = 1 '\001', restricted = 0 '\000', road_classification = {motorway_class = 0 '\000', link_class = 0 '\000', may_be_ignored = 0 '\000',
road_priority_class = 10 '\n', number_of_lanes = 2 '\002'}, highway_turn_classification = 0 '\000', access_turn_classification = 0 '\000'}, annotation_data = 13531483}
Oh sorry, I must've made the impression as if I know the code base :sweat_smile: not yet unfortunately.. However, the constructor seems to accept osm_node_ids
, so I'd imagine you should be able to get one out of there. Someone else with a dev setup for OSRM might be able to chime in more, sorry for that.
EDIT: I'm mostly interested if it's a data issue and what that is. Would affect other routers too and I'm maintaining one of them.
If this is a data quality issue, one way of making it easier to reproduce is to see if you can trigger it on a smaller section of the planet, recursing until you have a manageable test case.
However, both examples are showing an assertion failing when comparing to SPECIAL_EDGEID
, so it looks more like a planet-scale overflow problem.
It's suspicious that it's the foot profile - that profile includes the largest number of edges. EdgeId
is a uint32_t
, so if it turns out we've got > 2^32 edges being generated on the foot profile, then yeah, overflow is quite likely.
An older possibly related ticket where osrm-contract
has a similar issue: https://github.com/Project-OSRM/osrm-backend/issues/6169 - there's probably a whole cluster of overflow problems that have been creeping up given the continuing growth in OSM.
@jbalcar The quick workaround is to trim out bits of the planet you don't need - I know this isn't the best answer, but fixing the core bug might take a while.
Today, the problem still exists, even with osmosis pre-conversion.
My solution with osmosis was totally wrong.
Conversion "planet" to "cleaned-planet" with osmosis WITHOUT "--bounding-polygon" leads to same segfault on foot/bike profiles, crashes on osrm-extract (indeed, the number of edges after osmosis is the same).
When I specified --bounding-polygon (as I thought it was full planet data), the OSRM binaries were well done, but actually the routing was broken because after osmosis the planet was somewhat "cut".
I've used this polygon file but it leads to broken routing:
Planet
World
90 180
90 -180
-90 -180
-90 180
END
END
Finally, I'm trying to find some way to reduce number of edges for Planet-scale data.
With today's OSM Full Planet data, OSRM definitely crashes on bike or foot profile.
Please help me find some kind of workaround to fix the problem.
I've used this polygon file but it leads to broken routing:
That's the whole planet though isn't it? Doesn't do much filtering :) EDIT: ah sorry, just reading that it's broken. Probably a syntax issue? Don't know osmosis much..
There's no workaround other than
The problem with my workaround has just solved.
Current solution is:
1) use osmosis with correct Planet-polygon (see below)
2) remove some country(-ies) you don't need anymore (for example, ruzzia, etc)
3) if you don't have 700+ GB RAM or super fast swap (nvme), limit the threads to 25% of default (option -t)
Finally, my Planet was re-built successfully in ~108 hours (car, foot, bike profiles).
Some stats:
RAM: [info] RAM: peak bytes used: 458,707,828,736 (but actually 500 GB + 220 GB swap were used)
Edges loaded: [info] Loaded edge based graph: 3,336,178,894 edges, 819,849,166 nodes
PS. my mistake was wrong lat/lon sequence of Planet-polygon
Correct is:
Planet
World
180 90
-180 90
-180 -90
180 -90
END
END
Hello,
I'm extracting full planet by osrm-extract command with foot profile but it everytime failed with Segmentation fault (error 4) on Generating edge-expanded graph representation.
I have correct planet.osm.pbf file checked by md5sum from 2023-05-01. Application is build by commands used on main osrm-backend site. Version of application is latest - 5.27.1.
But when I tried it with Slovakia.osm.pbf (small country) it run correctly.
I have table PC with this parameters:
I also started swap and disk space monitoring with the process, but they seem to be large enough.
On last step I build application in debug mode and here is end of result:
Where can be a problem? Thanks a lot