Closed Ozaq closed 2 years ago
I spend some time looking into the coredump. One of the Pedestrian
objects in neighbors returned from LCGrid::GetNeighbourhood(...)
is broken
You can see there that the code in frame 3 is operating on a 0x0 this pointer
(gdb) f 3
#3 0x000055f143d5987c in Room::GetSubRoom (this=0x0, index=32560)
at /mnt/fast/projects/jpscore/geometry/Room.cpp:132
132 if(_subRooms.count(index)==0)
(gdb) p this
$1 = (const Room * const) 0x0
This is because in frame 4
/mnt/fast/projects/jpscore/math/VelocityModel.cpp:248
SubRoom* sb2=building->GetRoom(ped1->GetRoomID())->GetSubRoom(ped1->GetSubRoomID());
ped1
is invalid and returns _subRoomID = 32560
(gdb) p ped1
$4 = (Pedestrian *) 0x7f30d40008d0
(gdb) p *ped1
$2 = {_vptr.Pedestrian = 0x200000306070102, _id = 0, _exitIndex = 0, _group = 0,
_desiredFinalDestination = 0, _height = 1.09070583655615e-311, _age = 2.7813847636974126e-309,
_premovement = 2.7813847637012071e-309, _riskTolerance = 2.5394974196240072e-321,
_gender = <error reading variable: Cannot create a lazy string with address 0x0, and a non-zero length.>, _mass = 4.6686596202906463e-310, _tau = 4.6686596285364043e-310, _T = 0, _deltaT = 0,
_ellipse = {_vel = {_x = 6.9094006546270402e-310, _y = 0}, _center = {_x = 0, _y = 0},
_cosPhi = 0, _sinPhi = 0, _Xp = 0, _Amin = 0, _Av = 0, _Bmin = 0, _Bmax = 0, _vel0 = 0,
_do_stretch = false}, _V0 = {_x = 0, _y = 0}, _swayFreqA = 0, _swayFreqB = 0, _swayAmpA = 0,
_swayAmpB = 0, _V0UpStairs = 0, _V0DownStairs = 0, _EscalatorUpStairs = 6.909394026066053e-310,
_EscalatorDownStairs = 6.9093940238534294e-310, _V0IdleEscalatorUpStairs = 0,
_V0IdleEscalatorDownStairs = 0, _roomCaption = "", _roomID = -738083648, _subRoomID = 32560,
_subRoomUID = 0, _oldRoomID = 0, _oldSubRoomID = -738097072, _lastE0 = {_x = 0,
_y = 6.9093940280549636e-310}, _navLine = 0x7f30d40106d0,
_mentalMap = std::map with 0 elements,
_destHistory = std::vector of length -20540, capacity -34961923003140 = {-738111696, 32560,
-738195248, 32560, -951335008, 1071667057, -211739298, 1069689780, 116755709, 1073042417,
-1873335650, -1077876397, -2035872347, 1071830865, -296199030, 1069812776, 1015382044,
1072078908, 315321640, -1079132372, 1589197356, 1072182481, 1197220445, 1061368108,
-1987053528, 1072321235, 24799102, -1076174217, -960339058, 1072517116, -2127333405,
1068631508, 1614014102, 1073048652, -1008595101, 1067856868, -270358072, 1073047216,
-1513738360, 1068909767, 452125205, 1071661146, -1995068147, 1069321896, -1876693440,
1071993107, -676381656, 1070204279, -457592502, 1073050105, 755111029, 1067340388, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 805, 0...},
_trip = std::vector of length 0, capacity 0, _lastPosition = {_x = 0, _y = 0},
_lastCellPosition = 0, _knownDoors = std::map with 0 elements, _distToBlockade = 0,
_reroutingThreshold = 0, _timeBeforeRerouting = 4.0562789523566341e-321,
_timeInJam = 6.9093940233403916e-310, _patienceTime = 6.9093940232708272e-310,
_recordingTime = 0.62839401450811716,
_lastPositions = std::queue wrapping: std::deque with 580877179009208861 elements = {
<error reading variable>
Looking at the neighbors retrieved from LCGrid:
(gdb) p neighbours
$3 = std::vector of length 16, capacity 16 = {0x55f1446111a0, 0x7f30d40008d0, 0x55f1446153a0,
0x55f144614b60, 0x55f1446132a0, 0x55f144610750, 0x55f14460ee90, 0x55f14460e650, 0x55f14460de10,
0x55f144608e60, 0x55f144605db0, 0x55f144603d00, 0x55f144602c80, 0x55f1445fed10, 0x55f1445f8570,
0x55f144618520}
You can see the 2nd element has the address of the faulty object [0x7f30d40008d0]
However when looking at the _allPedestrians
filed in Building we cannot see this address at all!
(gdb) p building->_allPedestians
$8 = std::vector of length 28, capacity 64 = {0x55f1445f8570, 0x55f1445fed10, 0x55f144601370,
0x55f144602c80, 0x55f144603d00, 0x55f144605570, 0x55f144605db0, 0x55f1446065f0, 0x55f1446086c0,
0x55f144608e60, 0x55f144609600, 0x55f14460a540, 0x55f14460ace0, 0x55f14460c550, 0x55f14460d5d0,
0x55f14460de10, 0x55f14460e650, 0x55f14460ee90, 0x55f144610750, 0x55f1446111a0, 0x55f1446119e0,
0x55f1446132a0, 0x55f144614b60, 0x55f1446153a0, 0x55f144615be0, 0x55f144617ce0, 0x55f144618520,
0x55f144618d60}
Looking into LCGrid::_localPedsCopy
there is also no 0x7f30d40008d0 to be seen...
(gdb) p *building->_linkedCellGrid._localPedsCopy@51
$46 = {0x55f1445f8570, 0x55f1445fed10, 0x55f144601370, 0x0, 0x0, 0x55f144602c80, 0x0,
0x55f144603d00, 0x0, 0x0, 0x55f144605570, 0x55f144605db0, 0x55f1446065f0, 0x0, 0x0, 0x0,
0x55f1446086c0, 0x55f144608e60, 0x55f144609600, 0x0, 0x55f14460a540, 0x55f14460ace0, 0x0, 0x0,
0x55f14460c550, 0x0, 0x55f14460d5d0, 0x55f14460de10, 0x55f14460e650, 0x55f14460ee90, 0x0, 0x0,
0x55f144610750, 0x55f1446111a0, 0x55f1446119e0, 0x0, 0x0, 0x55f1446132a0, 0x0, 0x0,
0x55f144614b60, 0x55f1446153a0, 0x55f144615be0, 0x0, 0x0, 0x0, 0x55f144617ce0, 0x55f144618520,
0x55f144618d60, 0x0, 0x0}
So what modified the 2nd entry in the local neighbours
varible from VelocityModel.cpp:190
?
@schroedtert didn't you run into neighborhood's issues one day? I thought this might be related.
Sounds like a deleted pedestrian.
Yes I did, if I remember correctly it was in a larger simulation, which broke at some point with a SegFault during some neighbourhood calls. But I can’t remember what we tried to solve it.
@Ozaq Did you use multiple threads? I guess there is something wrong with Simulation::UpdateRoutesAndLocations()
, here some peds get deleted if they do not have a valid route. With multiple threads it may not be correctly at all places they have to be deleted. And for some reason in the VelocityModel some pedestrians are deleted too, tbh I don’t know why... At some point we need to have a closer look at Simulation.cpp to find such pitfalls.
What happens, if we don't delete pedestrians anymore?
I thought the deletion is done to remove stuck agents, e.g. when they end up in obstacles or went through walls?
That's right. I was just thinking to deactivate them all so that we can pin-point the bug.
@schroedtert Yeah I am running with USE_OPENMP=ON
and there are 4 threads active (but i only added the stack traces of the thread running into the segfault.
@chraibi I also suspect Ped deletion to be the reason but I could not find any obvious issue, also the broken ped pointer points into an "interesting" location, i.e. just deleting the Ped would leave the address intact but the data behind could be overwritten. The address 0x7f30d40008d0
is not close to any of the other threads stacks, does not make immediate sense if interpreted as ints / chars however its rather close (~900 bytes to the backing memory of the pedestrian vector indicating that this is pointer points to something created close to the backing memory. Almost like a stray write, i.e. a miscalculated offset out of bounds of some other heap allocated array. I did not have the patience to dig any further :/
Parallelization with openmp has been removed, pedestrian removal has been reworked as well. I have not observed this crash since then.
Describe the bug Intermittent core dump when running juelich_tests test_10 in debug, much more frequent when running in release (almost 100%)
Backtrace