PedestrianDynamics / jupedsim

JuPedSim is an open source pedestrian dynamics simulator
http://jupedsim.org
Other
39 stars 27 forks source link

Race condition / spurious core dump VeleocityModel::ComputeNextTimeStep #464

Closed Ozaq closed 2 years ago

Ozaq commented 5 years ago
JuPedSim - JPScore

Current date   : Aug  9 2019 00:46:56
Version        : 0.8.4
Compiler       : g++ (9.1.0)
Commit hash    : v0.8.4-160-g307b7c45-dirty
Commit date    : Fri Aug 9 00:37:34 2019
Branch         : fix-juelich_tests_test_10

Describe the bug Intermittent core dump when running juelich_tests test_10 in debug, much more frequent when running in release (almost 100%)

Backtrace

Core was generated by `/mnt/fast/projects/jpscore/bin/jpscore --inifile=inifiles/ini_seed_7778.0_linke'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055f143c3662c in std::_Rb_tree<int, std::pair<int const, std::shared_ptr<SubRoom> >, std::_Select1st<std::pair<int const, std::shared_ptr<SubRoom> > >, std::less<int>, std::allocator<std::pair<int const, std::shared_ptr<SubRoom> > > >::_M_begin (this=0x38)
    at /usr/include/c++/9/bits/stl_tree.h:751
751       (this->_M_impl._M_header._M_parent);
[Current thread is 1 (Thread 0x7f30e18c4700 (LWP 27357))]
(gdb) bt
#0  0x000055f143c3662c in std::_Rb_tree<int, std::pair<int const, std::shared_ptr<SubRoom> >, std::_Select1st<std::pair<int const, std::shared_ptr<SubRoom> > >, std::less<int>, std::allocator<std::pair<int const, std::shared_ptr<SubRoom> > > >::_M_begin (this=0x38)
    at /usr/include/c++/9/bits/stl_tree.h:751
#1  0x000055f143d5a26f in std::_Rb_tree<int, std::pair<int const, std::shared_ptr<SubRoom> >, std::_Select1st<std::pair<int const, std::shared_ptr<SubRoom> > >, std::less<int>, std::allocator<std::pair<int const, std::shared_ptr<SubRoom> > > >::find (this=0x38, __k=@0x7f30e18c3c64: 32560)
    at /usr/include/c++/9/bits/stl_tree.h:2566
#2  0x000055f143d5a042 in std::map<int, std::shared_ptr<SubRoom>, std::less<int>, std::allocator<std::pair<int const, std::shared_ptr<SubRoom> > > >::count (this=0x38, __x=@0x7f30e18c3c64: 32560)
    at /usr/include/c++/9/bits/stl_map.h:1215
#3  0x000055f143d5987c in Room::GetSubRoom (this=0x0, index=32560)
    at /mnt/fast/projects/jpscore/geometry/Room.cpp:132
#4  0x000055f143c08f23 in VelocityModel::_ZN13VelocityModel19ComputeNextTimeStepEddP8Buildingi._omp_fn.0(void) () at /mnt/fast/projects/jpscore/math/VelocityModel.cpp:248
#5  0x00007f30e2642ee6 in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#6  0x00007f30e2610182 in start_thread (arg=<optimized out>) at pthread_create.c:486
#7  0x00007f30e21efb1f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Ozaq commented 5 years ago

I spend some time looking into the coredump. One of the Pedestrian objects in neighbors returned from LCGrid::GetNeighbourhood(...) is broken

You can see there that the code in frame 3 is operating on a 0x0 this pointer

(gdb) f 3
#3  0x000055f143d5987c in Room::GetSubRoom (this=0x0, index=32560)
    at /mnt/fast/projects/jpscore/geometry/Room.cpp:132
132      if(_subRooms.count(index)==0)
(gdb) p this
$1 = (const Room * const) 0x0

This is because in frame 4

/mnt/fast/projects/jpscore/math/VelocityModel.cpp:248
                    SubRoom* sb2=building->GetRoom(ped1->GetRoomID())->GetSubRoom(ped1->GetSubRoomID());

ped1 is invalid and returns _subRoomID = 32560

(gdb) p ped1
$4 = (Pedestrian *) 0x7f30d40008d0
(gdb) p *ped1
$2 = {_vptr.Pedestrian = 0x200000306070102, _id = 0, _exitIndex = 0, _group = 0, 
  _desiredFinalDestination = 0, _height = 1.09070583655615e-311, _age = 2.7813847636974126e-309, 
  _premovement = 2.7813847637012071e-309, _riskTolerance = 2.5394974196240072e-321, 
  _gender = <error reading variable: Cannot create a lazy string with address 0x0, and a non-zero length.>, _mass = 4.6686596202906463e-310, _tau = 4.6686596285364043e-310, _T = 0, _deltaT = 0, 
  _ellipse = {_vel = {_x = 6.9094006546270402e-310, _y = 0}, _center = {_x = 0, _y = 0}, 
    _cosPhi = 0, _sinPhi = 0, _Xp = 0, _Amin = 0, _Av = 0, _Bmin = 0, _Bmax = 0, _vel0 = 0, 
    _do_stretch = false}, _V0 = {_x = 0, _y = 0}, _swayFreqA = 0, _swayFreqB = 0, _swayAmpA = 0, 
  _swayAmpB = 0, _V0UpStairs = 0, _V0DownStairs = 0, _EscalatorUpStairs = 6.909394026066053e-310, 
  _EscalatorDownStairs = 6.9093940238534294e-310, _V0IdleEscalatorUpStairs = 0, 
  _V0IdleEscalatorDownStairs = 0, _roomCaption = "", _roomID = -738083648, _subRoomID = 32560, 
  _subRoomUID = 0, _oldRoomID = 0, _oldSubRoomID = -738097072, _lastE0 = {_x = 0, 
    _y = 6.9093940280549636e-310}, _navLine = 0x7f30d40106d0, 
  _mentalMap = std::map with 0 elements, 
  _destHistory = std::vector of length -20540, capacity -34961923003140 = {-738111696, 32560, 
    -738195248, 32560, -951335008, 1071667057, -211739298, 1069689780, 116755709, 1073042417, 
    -1873335650, -1077876397, -2035872347, 1071830865, -296199030, 1069812776, 1015382044, 
    1072078908, 315321640, -1079132372, 1589197356, 1072182481, 1197220445, 1061368108, 
    -1987053528, 1072321235, 24799102, -1076174217, -960339058, 1072517116, -2127333405, 
    1068631508, 1614014102, 1073048652, -1008595101, 1067856868, -270358072, 1073047216, 
    -1513738360, 1068909767, 452125205, 1071661146, -1995068147, 1069321896, -1876693440, 
    1071993107, -676381656, 1070204279, -457592502, 1073050105, 755111029, 1067340388, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 805, 0...}, 
  _trip = std::vector of length 0, capacity 0, _lastPosition = {_x = 0, _y = 0}, 
  _lastCellPosition = 0, _knownDoors = std::map with 0 elements, _distToBlockade = 0, 
  _reroutingThreshold = 0, _timeBeforeRerouting = 4.0562789523566341e-321, 
  _timeInJam = 6.9093940233403916e-310, _patienceTime = 6.9093940232708272e-310, 
  _recordingTime = 0.62839401450811716, 
  _lastPositions = std::queue wrapping: std::deque with 580877179009208861 elements = {
    <error reading variable>

Looking at the neighbors retrieved from LCGrid:

(gdb) p neighbours 
$3 = std::vector of length 16, capacity 16 = {0x55f1446111a0, 0x7f30d40008d0, 0x55f1446153a0, 
  0x55f144614b60, 0x55f1446132a0, 0x55f144610750, 0x55f14460ee90, 0x55f14460e650, 0x55f14460de10, 
  0x55f144608e60, 0x55f144605db0, 0x55f144603d00, 0x55f144602c80, 0x55f1445fed10, 0x55f1445f8570, 
  0x55f144618520}

You can see the 2nd element has the address of the faulty object [0x7f30d40008d0] However when looking at the _allPedestrians filed in Building we cannot see this address at all!

(gdb) p building->_allPedestians
$8 = std::vector of length 28, capacity 64 = {0x55f1445f8570, 0x55f1445fed10, 0x55f144601370, 
  0x55f144602c80, 0x55f144603d00, 0x55f144605570, 0x55f144605db0, 0x55f1446065f0, 0x55f1446086c0, 
  0x55f144608e60, 0x55f144609600, 0x55f14460a540, 0x55f14460ace0, 0x55f14460c550, 0x55f14460d5d0, 
  0x55f14460de10, 0x55f14460e650, 0x55f14460ee90, 0x55f144610750, 0x55f1446111a0, 0x55f1446119e0, 
  0x55f1446132a0, 0x55f144614b60, 0x55f1446153a0, 0x55f144615be0, 0x55f144617ce0, 0x55f144618520, 
  0x55f144618d60}

Looking into LCGrid::_localPedsCopy there is also no 0x7f30d40008d0 to be seen...

(gdb) p *building->_linkedCellGrid._localPedsCopy@51
$46 = {0x55f1445f8570, 0x55f1445fed10, 0x55f144601370, 0x0, 0x0, 0x55f144602c80, 0x0, 
  0x55f144603d00, 0x0, 0x0, 0x55f144605570, 0x55f144605db0, 0x55f1446065f0, 0x0, 0x0, 0x0, 
  0x55f1446086c0, 0x55f144608e60, 0x55f144609600, 0x0, 0x55f14460a540, 0x55f14460ace0, 0x0, 0x0, 
  0x55f14460c550, 0x0, 0x55f14460d5d0, 0x55f14460de10, 0x55f14460e650, 0x55f14460ee90, 0x0, 0x0, 
  0x55f144610750, 0x55f1446111a0, 0x55f1446119e0, 0x0, 0x0, 0x55f1446132a0, 0x0, 0x0, 
  0x55f144614b60, 0x55f1446153a0, 0x55f144615be0, 0x0, 0x0, 0x0, 0x55f144617ce0, 0x55f144618520, 
  0x55f144618d60, 0x0, 0x0}

So what modified the 2nd entry in the local neighbours varible from VelocityModel.cpp:190 ?

chraibi commented 5 years ago

@schroedtert didn't you run into neighborhood's issues one day? I thought this might be related.

Sounds like a deleted pedestrian.

schroedtert commented 5 years ago

Yes I did, if I remember correctly it was in a larger simulation, which broke at some point with a SegFault during some neighbourhood calls. But I can’t remember what we tried to solve it. @Ozaq Did you use multiple threads? I guess there is something wrong with Simulation::UpdateRoutesAndLocations(), here some peds get deleted if they do not have a valid route. With multiple threads it may not be correctly at all places they have to be deleted. And for some reason in the VelocityModel some pedestrians are deleted too, tbh I don’t know why... At some point we need to have a closer look at Simulation.cpp to find such pitfalls.

chraibi commented 5 years ago

What happens, if we don't delete pedestrians anymore?

schroedtert commented 5 years ago

I thought the deletion is done to remove stuck agents, e.g. when they end up in obstacles or went through walls?

chraibi commented 5 years ago

That's right. I was just thinking to deactivate them all so that we can pin-point the bug.

Ozaq commented 5 years ago

@schroedtert Yeah I am running with USE_OPENMP=ON and there are 4 threads active (but i only added the stack traces of the thread running into the segfault. @chraibi I also suspect Ped deletion to be the reason but I could not find any obvious issue, also the broken ped pointer points into an "interesting" location, i.e. just deleting the Ped would leave the address intact but the data behind could be overwritten. The address 0x7f30d40008d0 is not close to any of the other threads stacks, does not make immediate sense if interpreted as ints / chars however its rather close (~900 bytes to the backing memory of the pedestrian vector indicating that this is pointer points to something created close to the backing memory. Almost like a stray write, i.e. a miscalculated offset out of bounds of some other heap allocated array. I did not have the patience to dig any further :/

Ozaq commented 2 years ago

Parallelization with openmp has been removed, pedestrian removal has been reworked as well. I have not observed this crash since then.