Project-OSRM / osrm-backend

Open Source Routing Machine - C++ backend
http://map.project-osrm.org
BSD 2-Clause "Simplified" License
6.4k stars 3.39k forks source link

Server crashes when routing is attempted using an europe.pbf #593

Closed janboe closed 11 years ago

janboe commented 11 years ago

Hi, I'm currently trying to set up osrm to calculate routes across Europe and I'm running into a major problem. First tests using just germany.pbf (all pbfs from geofabrik) have been very successful, however when using mapdata for the whole of Europe osrm segfaults. Both a colleague and me have been running into this issue on separate machines.

When starting osrm-routed build in release mode everything seems normal, however when trying to request a route it segfaults:

$ ./osrm-routed

[server] starting up engines, saved at Fri Feb 15 13:06:47 2013
[server] http 1.1 compression handled by zlib version 1.2.7
[info Server/DataStructures/QueryObjectsStorage.cpp:26] loading graph data
[info Server/DataStructures/QueryObjectsStorage.cpp:34] Data checksum is 3221414729
[info Server/DataStructures/QueryObjectsStorage.cpp:40] Loading Timestamp
[info Server/DataStructures/QueryObjectsStorage.cpp:52] Loading auxiliary information
[info Server/DataStructures/QueryObjectsStorage.cpp:62] Loading names index
[info Server/DataStructures/QueryObjectsStorage.cpp:80] All query data structures loaded
[handler] registering plugin hello
[handler] registering plugin locate
[handler] registering plugin nearest
[handler] registering plugin timestamp
[handler] registering plugin viaroute
[server] running and waiting for requests
[info Server/RequestHandler.h:64] ... (... /viaroute?loc=37.388096,-5.98233&loc=70.663438,23.681967
Segmentation fault (core dumped)

When building in debug mode, it starts spamming the console with those messages:

...
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:108] cannot find second segment of edge (188530391,142855797,38233832)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:103] cannot find first segment of edge (188530392,142855701,38233831)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:108] cannot find second segment of edge (188530392,142855701,38233831)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:103] cannot find first segment of edge (188530392,188530412,163656840)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:108] cannot find second segment of edge (188530392,188530412,163656840)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:103] cannot find first segment of edge (188530393,188530389,188530385)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:108] cannot find second segment of edge (188530393,188530389,188530385)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:103] cannot find first segment of edge (188530395,188530405,163656839)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:108] cannot find second segment of edge (188530395,188530405,163656839)
[debug Server/DataStructures/../../DataStructures/StaticGraph.h:103] cannot find first segment of edge (188530396,188530410,163656838)
...

Writing those into a file results in multiple gigabytes of error logs.

The machine is running an up-to-date Ubuntu 12.10

Please see this link for a complete backtrace by gdb: https://gist.github.com/janboe/09eae50d2c641d20944b

Thanks Jan

DennisOSRM commented 11 years ago

Is it a 32 Bit Linux?

janboe commented 11 years ago

No, it's 64bit

DennisOSRM commented 11 years ago

Preprocessing done on the same source code version and machine?

janboe commented 11 years ago

Yes. Also, i just checked, the same thing happens when using the develop branch

janboe commented 11 years ago

I updated the backtrace with a more complete one if that is any help

DennisOSRM commented 11 years ago

Could you post the contents of your server.ini?

janboe commented 11 years ago

Sure:

$ cat server.ini
Threads = 8
IP = 0.0.0.0
Port = 5000

hsgrData=/run/shm/europe.osrm.hsgr
nodesData=/run/shm/europe.osrm.nodes
edgesData=/run/shm/europe.osrm.edges
ramIndex=/run/shm/europe.osrm.ramIndex
fileIndex=/run/shm/europe.osrm.fileIndex
namesData=/run/shm/europe.osrm.names
timestamp=/run/shm/europe.osrm.timestamp
DennisOSRM commented 11 years ago

Rather strange. It's complaining that the .hsgr file is broken, which should never actually happen. Are you sure that you copied everything correctly?

emiltin commented 11 years ago

try downloading the europe.pbf again. errors in the iput data can cause weird bugs.

DennisOSRM commented 11 years ago

It shouldnt break at this stage, but it won't hurt. Especially as there were some problems with the Geofabrik extracts lately: https://twitter.com/geofabrik/status/300527039507730432

janboe commented 11 years ago

I'm pretty sure. This happened with multiple versions of the pbf. Also everything was directly generated in /run/shm (this machine as quiete a bit of ram).

I'll try again tomorrow or on monday, maybe I'll try using an xml this time.

Thanks to both of you for your help.

janboe commented 11 years ago

I just tried the whole thing again, this time without involving geofabrik but with the latest planet.osm.pbf and the exact same crash happens. I guess I'm doing something wrong since I seem to be the only one with this problem, but I just don't know what.

emiltin commented 11 years ago

maybe running the cucumber tests could turn up something?

janboe commented 11 years ago

Some tests do indeed fail (master, release build), They mostly seem to result from osrm providing different routes than expected.

Here is the complete test/ directory after cucumber has been run: https://docs.google.com/file/d/0B5-JD-Ro7uVGdFlSTWxyME5La2c/edit?usp=sharing And here is just the fail.log: https://gist.github.com/janboe/3e28d518a8f789c6af8c

DennisOSRM commented 11 years ago

What's your platform? X86?

janboe commented 11 years ago

x86_64

emiltin commented 11 years ago

sorry, i forgot to mention that you should exclude test for stuff that's not yet implemented yet (the tests marked with @todo). rerunning tests are faster second time, since the oms and pbf files are cached. you don't have to upload the entire test folder - it's big :-) just post the command line output of 'cucumber -p verify'.

one thing that puzzles me is that it looks like the test scenario "Bike - Except tag and on no_ restrictions" uses the car profile. maybe that's why it fails. however i doubt this is directly related to your original problem with crashes.

janboe commented 11 years ago

Alright, here we go:

$ cucumber -t ~@todo -p verify 
Using the verify profile...
Ruby version 1.9.3
Using default port 5000
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

155 scenarios (155 passed)
562 steps (562 passed)
3m52.198s
emiltin commented 11 years ago

all green, so your build works as excepted for these small test datasets.

janboe commented 11 years ago

Yes, it also seems to work flawlessly on data the size of Germany or smaller, it only starts crashing when handling Europe or the whole planet. Using only germany.osm.pbf I can get routes without any problems

janboe commented 11 years ago

I'm still tinkering about trying to get this to work but I'm still failing, I'll just write down everything I'm doing from start to crash, maybe this helps. Also, I think this looks a bit similar to https://github.com/DennisOSRM/Project-OSRM/issues/83

The machine(s): The last few times I tried it were on a amazon ec2 "High Memory Cluster Eight Extra Large"-instance, we get full access to the machines cpus here, so --no-march isn't needed (I also tried it with and without that flag, no difference). I have tried it on different instance types before, --no-march is needed there, but it still crashes.

The software: The instances are running an up-to-date ubuntu 12.10 or 12.04 installation, always 64bit versions. Dependencies are installed according to the wiki. OSRM is then checked out (master or develop, both crash) and build. The behavior in the testsuite is always the same as above.

The input data: The input pbf is loaded either from geofabrik or directly from planet.openstreetmap.org. germany.osm.pbf as well as its subset stuttgart-regbez.osm.pbf work, europe.osm.pbf and planet.osm.pbf don't. The files have been downloaded multiple times, the phenomenon is always the same.

Any ideas how I could narrow this down further?

emiltin commented 11 years ago

maybe: try with different queries? different locations, parameters.. try the /locate and /nearest API try with different profiles (car/bicycle)

janboe commented 11 years ago

different queries still cause the same crash, /locate and /nearest work, it only crashes when trying to calculate routes. I'm currently setting up the whole thing using the bike profile. Just copying profiles/bicycle.lua to profile.lua should do the trick, right?

emiltin commented 11 years ago

yes either copy the profile, or specify the path to it on the command line

janboe commented 11 years ago

Ok, so the extraction is done. Does the number of processed nodes and edges look right for an europe.osm.pdf?

$ ./osrm-extract /mnt/europe.osm.pbf
[info Extractor/ScriptingEnvironment.cpp:33] Using script profile.lua
[info extractor.cpp:59] extracting data from input file /mnt/europe.osm.pbf
[STXXL-MSG] STXXL v1.3.1 (release)
[STXXL-MSG] 1 disks are allocated, total space: 200000 MiB
[info Extractor/PBFParser.h:167] Parse Data Thread Finished
[info extractor.cpp:109] parsing finished after 1315.22 seconds
[extractor] Sorting used nodes        ... ok, after 21.1672s
[extractor] Erasing duplicate nodes   ... ok, after 11.1314s
[extractor] Sorting all nodes         ... ok, after 43.7396s
[extractor] Sorting used ways         ... ok, after 1.98333s
[extractor] Sorting restrctns. by from... ok, after 2.00468s
[extractor] Fixing restriction starts ... ok, after 2.53643s
[extractor] Sorting restrctns. by to  ... ok, after 0.0188708s
[extractor] Fixing restriction ends   ... ok, after 0.542831s
[info Extractor/ExtractionContainers.cpp:120] usable restrictions: 92674
[extractor] Confirming/Writing used nodes     ... ok, after 41.0108s
[extractor] setting number of nodes   ... ok
[extractor] Sorting edges by start    ... ok, after 56.0873s
[extractor] Setting start coords      ... ok, after 46.8795s
[extractor] Sorting edges by target   ... ok, after 61.9819s
[extractor] Setting target coords     ... ok, after 199.474s
[extractor] setting number of edges   ... ok
[extractor] writing street name index ... ok, after 0.411548s
[info Extractor/ExtractionContainers.cpp:298] Processed 212105571 nodes and 224487690 edges
[info extractor.cpp:116] finished

Run:
./osrm-prepare /mnt/europe.osrm /mnt/europe.osrm.restrictions
DennisOSRM commented 11 years ago

It is the biking profile, right? The number of nodes appears to be rather high at first sight.

Am 19.02.2013 um 10:26 schrieb janboe notifications@github.com:

Ok, so the extraction is done. Does the number of processed nodes and edges look right for an europe.osm.pdf?

$ ./osrm-extract /mnt/europe.osm.pbf [info Extractor/ScriptingEnvironment.cpp:33] Using script profile.lua [info extractor.cpp:59] extracting data from input file /mnt/europe.osm.pbf [STXXL-MSG] STXXL v1.3.1 (release) [STXXL-MSG] 1 disks are allocated, total space: 200000 MiB [info Extractor/PBFParser.h:167] Parse Data Thread Finished [info extractor.cpp:109] parsing finished after 1315.22 seconds [extractor] Sorting used nodes ... ok, after 21.1672s [extractor] Erasing duplicate nodes ... ok, after 11.1314s [extractor] Sorting all nodes ... ok, after 43.7396s [extractor] Sorting used ways ... ok, after 1.98333s [extractor] Sorting restrctns. by from... ok, after 2.00468s [extractor] Fixing restriction starts ... ok, after 2.53643s [extractor] Sorting restrctns. by to ... ok, after 0.0188708s [extractor] Fixing restriction ends ... ok, after 0.542831s [info Extractor/ExtractionContainers.cpp:120] usable restrictions: 92674 [extractor] Confirming/Writing used nodes ... ok, after 41.0108s [extractor] setting number of nodes ... ok [extractor] Sorting edges by start ... ok, after 56.0873s [extractor] Setting start coords ... ok, after 46.8795s [extractor] Sorting edges by target ... ok, after 61.9819s [extractor] Setting target coords ... ok, after 199.474s [extractor] setting number of edges ... ok [extractor] writing street name index ... ok, after 0.411548s [info Extractor/ExtractionContainers.cpp:298] Processed 212105571 nodes and 224487690 edges [info extractor.cpp:116] finished

Run: ./osrm-prepare /mnt/europe.osrm /mnt/europe.osrm.restrictions — Reply to this email directly or view it on GitHub.

janboe commented 11 years ago

Alright, preprocessing is done:

$ ./osrm-prepare /mnt/europe.osrm /mnt/europe.osrm.restrictions
[info createHierarchy.cpp:78] Using restrictions from file: /mnt/europe.osrm.restrictions
[info createHierarchy.cpp:116] Parsing speedprofile from profile.lua
[info Util/GraphLoader.h:190] Graph loaded ok and has 224248859 edges
[info createHierarchy.cpp:137] 92674 restrictions, 136596 bollard nodes, 189767 traffic lights
[info createHierarchy.cpp:146] Generating edge-expanded graph representation
[info Contractor/EdgeBasedGraphFactory.cpp:171] Identifying small components
. 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%
[info Contractor/EdgeBasedGraphFactory.cpp:225] identified: 753610 many components
. 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%
[info Contractor/EdgeBasedGraphFactory.cpp:328] Node-based graph contains 972439611 edges
[info Contractor/EdgeBasedGraphFactory.cpp:330] Edge-based graph skipped 115464 turns, defined by 92674 restrictions.
[info Contractor/EdgeBasedGraphFactory.cpp:331] Generated 438438585 edge based nodes
[info createHierarchy.cpp:161] writing node map ...
[info createHierarchy.cpp:170] writing info on original edges
[info createHierarchy.cpp:183] building grid ...
[STXXL-MSG] STXXL v1.3.1 (release)
[STXXL-MSG] 1 disks are allocated, total space: 200000 MiB
. 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%
[info DataStructures/NNGrid.h:114] finished sorting after 85.1483s
writing data .... 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%
using hardware base sse computation
[info createHierarchy.cpp:190] CRC32 based checksum is 824147423
[info createHierarchy.cpp:196] initializing contractor
merged 131768 edges out of 1068002052
contractor finished initalization
Contractor is using 32 threads
initializing elimination PQ ...ok
preprocessing .... 10% . 20% . 30% . 40% . 50% . 60% . [flush 299394816 nodes]  70% . 80% . 90% . 100%
[info createHierarchy.cpp:200] Contraction took 7102.52 sec
[info Contractor/Contractor.h:437] Getting edges of minimized graph
. 10% . 20% . 30% . 40% . 50% . 60% . 70% . 80% . 90% . 100%
[info Contractor/Contractor.h:465] Renumbered edges of minimized graph, freeing space
[info Contractor/Contractor.h:468] Loading temporary edges
[info Contractor/Contractor.h:495] Hierarchy has 662169982 edges
[info createHierarchy.cpp:210] Building Node Array
[info createHierarchy.cpp:214] Serializing compacted graph
[info createHierarchy.cpp:266] Expansion  : 268012 nodes/sec and 554001 edges/sec
[info createHierarchy.cpp:267] Contraction: 554001 nodes/sec and 76176.9 edges/sec
[info createHierarchy.cpp:272] finished preprocessing
janboe commented 11 years ago

It doesn't crash using the bicycle profile, however routed seems to be unable to find even short routes (map.project-osrm.org does)

$ ./osrm-routed

[server] starting up engines, saved at Sat Feb 16 17:23:39 2013
[server] http 1.1 compression handled by zlib version 1.2.7
[info Server/DataStructures/QueryObjectsStorage.cpp:26] loading graph data
[info Server/DataStructures/QueryObjectsStorage.cpp:34] Data checksum is 824147423
[info Server/DataStructures/QueryObjectsStorage.cpp:40] Loading Timestamp
[info Server/DataStructures/QueryObjectsStorage.cpp:52] Loading auxiliary information
[info Server/DataStructures/QueryObjectsStorage.cpp:62] Loading names index
[info Server/DataStructures/QueryObjectsStorage.cpp:80] All query data structures loaded
[handler] registering plugin hello
[handler] registering plugin locate
[handler] registering plugin nearest
[handler] registering plugin timestamp
[handler] registering plugin viaroute
[server] running and waiting for requests
[info Server/RequestHandler.h:64] 19-02-2013 12:05:39 127.0.0.1 - Wget/1.13.4 (linux-gnu) /viaroute?loc=37.388096,-5.98233&loc=70.663438,23.681967
[info Server/RequestHandler.h:64] 19-02-2013 12:05:42 127.0.0.1 - Wget/1.13.4 (linux-gnu) /viaroute?loc=37.388096,-5.98233&loc=70.663438,23.681967
[info Server/RequestHandler.h:64] 19-02-2013 12:06:31 127.0.0.1 - Wget/1.13.4 (linux-gnu) /viaroute?loc=37.388096,-5.98233&loc=50.663438,23.681967
[info Server/RequestHandler.h:64] 19-02-2013 12:06:37 127.0.0.1 - Wget/1.13.4 (linux-gnu) /viaroute?loc=36.388096,-5.98233&loc=50.663438,23.681967
...
emiltin commented 11 years ago

try adding the z paramter. see #588

janboe commented 11 years ago

And here we go again:

osrm-routed: Plugins/../Descriptors/../DataStructures/../RoutingAlgorithms/BasicRoutingInterface.h:190: void BasicRoutingInterface<QueryDataT>::UnpackEdge(NodeID, NodeID, std::vector<unsigned int>&) const [with QueryDataT = SearchEngineData<QueryEdge::EdgeData, StaticGraph<QueryEdge::EdgeData> >; NodeID = unsigned int]: Assertion `smallestWeight != 2147483647' failed.

I got a few routes from it, although it seemed to be unable to provide a lot of routes (straigt down a road for a few hundred meters) that were available on map.project-orsm.org, responding with "Cannot find route between points", even though the same parameters as on m.p.o were used.

emiltin commented 11 years ago

i''m afraid @DennisOSRM will have to help you on this one. hopefully the various things you tried can help pinpoint the problem.

janboe commented 11 years ago

Thanks for your time!

janboe commented 11 years ago

Quick update: it doesn't make a difference if I use luajit or just plain old lua.

janboe commented 11 years ago

Ok, so I "fixed" it, in a way. OSRM works perfectly when running on a Fedora 18 instance (ami-6145cc08, on a m2.4xlarge). Building it is a bit awkward with fedora only providing /usr/lib64/pkgconfig/lua.pc instead of lua5.1.pc and not providing luabind.pc at all, however after symlinking lua.pc to lua5.1.pc and creating luabind.pc (I copied and edited lua.pc for this) it builds and runs perfectly. osrm-extract segfaults at the very end (after: "Run ./osrm-extract...") but routing and everything else works.

It seems that the problem is in some way related to ubuntu or the ubuntu images provided by amazon. I'll try it on a physical machine running ubuntu today,

emiltin commented 11 years ago

i wouldn't call osrm-extract segfaulting at the end "working perfectly" :-)

janboe commented 11 years ago

Ah well, as long as it only segfaults after it has done its job it's close enough. :)

DennisOSRM commented 11 years ago

The build on Fedora will become much easier once the feature/cmake branch is merged. The Amazon VMs are a bit fishy. We have had reports that certain opcodes are not supported although the CPU seems to advertise it.

Am 21.02.2013 um 09:34 schrieb janboe notifications@github.com:

Ok, so I "fixed" it, in a way. OSRM works perfectly when running on a Fedora 18 instance (ami-6145cc08, on a m2.4xlarge). Building it is a bit awkward with fedora only providing /usr/lib64/pkgconfig/lua.pc instead of lua5.1.pc and not providing luabind.pc at all, however after symlinking lua.pc to lua5.1.pc and creating luabind.pc (I copied and edited lua.pc for this) it builds and runs perfectly. osrm-extract segfaults at the very end (after: "Run ./osrm-extract...") but routing and everything else works.

It seems that the problem is in some way related to ubuntu or the ubuntu images provided by amazon. I'll try it on a physical machine running ubuntu today,

— Reply to this email directly or view it on GitHub.

janboe commented 11 years ago

Yes, as far as I can tell this happens on the non-hvm instances, were you have to build with --no-march (that's how I compiled the fedora build). The hvm/cluster instances give you full access to the underlying cpu, you can build anything with --march=native there, you can use icc with -fast -xHost if you so desire, and so on

janboe commented 11 years ago

Well, this is interesting: A local machine running the same Ubuntu version, with exactly the same packages installed, is able to run osrm without any problems. So, at the moment, I blame amazon.

janboe commented 11 years ago

So I narrowed this down a bit more: If I run osrm-extract and osrm-prepare on a non-amazon machine and then upload it to one it works. I guess that during extraction or preprocessing something goes wrong on amazon machines.

DennisOSRM commented 11 years ago

If you used the same input-data and still have the files lying around:

Could you run a md5sum on the data files and report which ones are different?

On 23.02.2013 11:49, janboe wrote:

So I narrowed this down a bit more: If I run osrm-extract and osrm-prepare on a non-amazon machine and then upload it to one it works. I guess that during extraction or preprocessing something goes wrong on amazon machines.

— Reply to this email directly or view it on GitHub https://github.com/DennisOSRM/Project-OSRM/issues/593#issuecomment-13988367.

janboe commented 11 years ago

I transfered the input data to an amazon instance and prepared it there, the .hsgr file is different, all others are the same. However I had already deleted the .osrm and .osrm.restrictions files from the local machine, if necessary I can generate them again.

DennisOSRM commented 11 years ago

I'd appreciate if you could check these files, too. Thanks.

janboe commented 11 years ago

Is running osrm-extract sufficent? Or does osrm-prepare change those files in any way? I'm just asking because this local machine isn't exactly the fastest, and -prepare might take a while.

DennisOSRM commented 11 years ago

osrm-extract is enough. -prepare just reads these data files.

On 24.02.2013 12:57, janboe wrote:

Is running osrm-extract sufficent? Or does osrm-prepare change those files in any way? I'm just asking because this local machine isn't exactly the fastest, and -prepare might take a while.

— Reply to this email directly or view it on GitHub https://github.com/DennisOSRM/Project-OSRM/issues/593#issuecomment-14007383.

janboe commented 11 years ago

Alright, done. the osrm and the osrm.restictions files are the same as on the amazon instance, the only file that differs is the .osrm.hsgr

keesklopt commented 11 years ago

Hi, janboe

I have been struggling with the same problem (it seems, same exact symptoms) for the last days now. In my case it seems related to (closed) issue #306.

Only because i was generating another small network while i was generating europe i found that my /tmp directory was full.

@DennisOSRM the contractor just went on and finished without any errors. But apparently with a truncated hierarchy file.

I enlarged /tmp by quite a bit and it worked again, difference :

before Hierarchy has 342839568 edges after Hierarchy has 723926725 edges

I think you might have the same problem since you start out with more nodes and edges than me (different profile ?) yet you have less hierarchy edges after contraction.

yours : info Extractor/ExtractionContainers.cpp:298] Processed 212105571 nodes and 224487690 edges and 662169982 edges in the CH

mine : [info Extractor/ExtractionContainers.cpp:298] Processed 132190308 nodes and 139082552 edges with 723926725 edges in th CH

Hope this helps, greetings Kees

PS. i already made swap and the space for stxxl huge a while ago, but having those too small usually gave a "killed" message.

DennisOSRM commented 11 years ago

Full /tmp directory might actually be an issue. thanks for reporting that. I will fix it.

janboe commented 11 years ago

Cool, a larger /tmp does indeed fix my issue. On the amazon instances I used /tmp was on the root device, which was only 8 gb big.

Thank you all very much.

keesklopt commented 11 years ago

No problem,

Actually this discussion helped me a lot as well with regards to amazon, which i am just starting to use this week.