Closed adeebshihadeh closed 1 year ago
@deanlee want to try this one? it's either stuck in the vision ipc connect or the model init
Checking.
Happened again with the logging and appears it's stuck in the DM model init
batman:debug$ ./filter_log_message.py "00605c23b4af39f5|2023-07-14--21-16-49" 0 | grep selfdrive.modeld | grep -v manager.py | grep -v "\/modeld.cc"
[60.259997] selfdrive/modeld/runners/snpemodel.cc:36 - SNPEModel: loaded model with size: 3630942
[60.264494] selfdrive/modeld/runners/snpemodel.cc:36 - SNPEModel: loaded model with size: 4443461
[60.607753] selfdrive/modeld/runners/snpemodel.cc:89 - addInput: adding index 0: input_img
[60.608591] selfdrive/modeld/navmodeld.cc:47 - main: models loaded, navmodeld starting
[60.609128] selfdrive/modeld/navmodeld.cc:56 - main: connected with buffer size: 98304
I think it's either #28598 or some kind of race condition now that we're running the navmodel on the DSP as well.
https://github.com/commaai/openpilot/pull/28598 looks fine so far. it's more likely some kind of race condition.
Looks like it's a race condition or something else related to running the navmodel.
navmodel enabled (#28448) on 6/16, first occurrence is 6/24. modeld refactor merged on 7/6.
The problem I've found so far is that mapsd
exits without releasing MapRenderer
, this can lead to deadlock or undefined behaviors in other related modules that use vipc/msgq.
But if we specify a parent for MapRenderer, on create it on the stack, the MapRenderer will hang on destructuring. vipc/msgq will also not get released.
~I've been testing this for a couple hours and it looks like the QMapboxGL thread is deadlocking on exit, The temporary solution is to explicitly delete all resources except the map in MapRenderer::~MapRenderer
, then call m_map->destroyRenderer,
which will exit the mapsd
with a segfault. although this is terrible, it might be worth a try before we find a final solution.~
update: QOpenGLFunctions is released twice. this causes mapsd to segfalut on exit.
you can see this bugs by create MapRenderer on stack:
MapRenderer m(get_mapbox_settings()); return app.exec();
corrupted size vs. prev_size in fastbins
Aborted (core dumped)
Repro below. You also need to make boardd listen to FakeIgnition
:
import time
from cereal.messaging import SubMaster
from common.realtime import DT_MDL
from common.params import Params
def all_dead(managerState):
# return if all processes that should be dead are dead
return all([not p.running or p.shouldBeRunning for p in managerState.processes])
if __name__ == "__main__":
params = Params()
params.put_bool("FakeIgnition", False)
sm = SubMaster(["driverStateV2", "managerState", "deviceState"])
occurrences = 0
loops = 0
while 1:
params.put_bool("FakeIgnition", True)
sm.update(0)
dmon_frame = None
# print('Waiting for driverStateV2')
st = time.monotonic()
timeout = 15 # s
# successful if we get 100 messages from dmonitoringmodeld (2s)
while time.monotonic() - st < timeout:
sm.update(0)
time.sleep(DT_MDL)
if sm.updated["driverStateV2"]:
if dmon_frame is None:
dmon_frame = sm.rcv_frame['driverStateV2']
if (sm.rcv_frame['driverStateV2'] - dmon_frame) > (2 / DT_MDL):
# print('Got driverStateV2! Exiting', sm.rcv_frame['driverStateV2'], dmon_frame)
time.sleep(1)
break
else:
occurrences += 1
print('WARNING: timed out in 15s waiting for 40 messages from dmonitoringmodeld, occurrences:', occurrences, sm.rcv_frame['driverStateV2'], dmon_frame)
print('CurrentRoute:', params.get('CurrentRoute'))
# TODO: is there a better way? we can't check managerState immediately since it takes a while to get the ignition
# wait for thermald to pick up ignition, then an update from managerState, and THEN it should be safe to check procs
params.put_bool("FakeIgnition", False)
while sm['deviceState'].started:
sm.update(0)
time.sleep(0.05)
while not sm.updated['managerState']:
sm.update(0)
time.sleep(0.05)
st = time.monotonic()
while time.monotonic() - st < timeout:
sm.update(0)
time.sleep(0.1)
if all_dead(sm['managerState']):
# print('all dead')
break
else:
print('WARNING: timed out waiting for processes to die!', time.monotonic() - st)
time.sleep(5)
loops += 1
if loops % 120 == 0:
print('Tries so far:', loops, 'occurrences:', occurrences)
Patch: https://github.com/commaai/openpilot/commit/9f02f004530c88c54e6b63e7eee78fee1671a006.patch
Added some logging in snpemodel.cc, here's what we got (note that there should be two of every log since navmodeld and dmonitoringmodeld both go through this path).
It looks like the DM model loaded first, then got stuck somewhere around here? I'm also not sure why prints are missing
[55769.278756] selfdrive/modeld/runners/snpemodel.cc:28 - SNPEModel: got runtime, models/navmodel_q.dlc
[55769.317072] selfdrive/modeld/runners/snpemodel.cc:37 - SNPEModel: loaded model with size: 4443461, models/dmonitoring_model_q.dlc
[55769.317815] selfdrive/modeld/runners/snpemodel.cc:41 - SNPEModel: snpe_builder, models/dmonitoring_model_q.dlc
[55769.318091] selfdrive/modeld/runners/snpemodel.cc:43 - SNPEModel: !snpe, models/dmonitoring_model_q.dlc
[55769.413642] selfdrive/modeld/runners/snpemodel.cc:37 - SNPEModel: loaded model with size: 3630942, models/navmodel_q.dlc
[55769.414311] selfdrive/modeld/runners/snpemodel.cc:41 - SNPEModel: snpe_builder, models/navmodel_q.dlc
[55769.414547] selfdrive/modeld/runners/snpemodel.cc:43 - SNPEModel: !snpe, models/navmodel_q.dlc
[55769.759642] selfdrive/modeld/modeld.cc:209 - main: models loaded, modeld starting
[55769.834323] selfdrive/modeld/runners/snpemodel.cc:56 - SNPEModel: build, models/navmodel_q.dlc
[55769.834793] selfdrive/modeld/runners/snpemodel.cc:61 - SNPEModel: got snpe, models/navmodel_q.dlc
[55769.835208] selfdrive/modeld/runners/snpemodel.cc:65 - SNPEModel: created output buffer, models/navmodel_q.dlc
[55769.835591] selfdrive/modeld/runners/snpemodel.cc:68 - SNPEModel: debug print1, models/navmodel_q.dlc
[55769.835965] selfdrive/modeld/runners/snpemodel.cc:73 - SNPEModel: debug print2, models/navmodel_q.dlc
[55769.836337] selfdrive/modeld/runners/snpemodel.cc:75 - SNPEModel: debug print3, models/navmodel_q.dlc
[55769.836704] selfdrive/modeld/runners/snpemodel.cc:82 - SNPEModel: debug print4, models/navmodel_q.dlc
[55769.837124] selfdrive/modeld/runners/snpemodel.cc:84 - SNPEModel: debug print5, models/navmodel_q.dlc
[55769.837508] selfdrive/modeld/runners/snpemodel.cc:86 - SNPEModel: debug print6, models/navmodel_q.dlc
[55769.837878] selfdrive/modeld/runners/snpemodel.cc:98 - addInput: got input tensor names, map
[55769.838253] selfdrive/modeld/runners/snpemodel.cc:102 - addInput: got single tensor name, map
[55769.838623] selfdrive/modeld/runners/snpemodel.cc:104 - addInput: adding index 0: input_img
[55769.838992] selfdrive/modeld/runners/snpemodel.cc:107 - addInput: new debug print1, map
[55769.839364] selfdrive/modeld/runners/snpemodel.cc:112 - addInput: new debug print2, map
[55769.839742] selfdrive/modeld/runners/snpemodel.cc:120 - addInput: new debug print3, map
[55769.840396] selfdrive/modeld/runners/snpemodel.cc:127 - addInput: new debug print4, map
[55769.840799] selfdrive/modeld/runners/snpemodel.cc:130 - addInput: new debug print5, map
[55769.841187] selfdrive/modeld/runners/snpemodel.cc:133 - addInput: new debug print6, map
[55769.841564] selfdrive/modeld/navmodeld.cc:47 - main: models loaded, navmodeld starting
[55769.841991] selfdrive/modeld/navmodeld.cc:56 - main: connected with buffer size: 98304
[55770.661376] selfdrive/modeld/modeld.cc:226 - main: vision stream set up, main_wide_camera: 0, use_extra_client: 1
[55770.706299] selfdrive/modeld/modeld.cc:240 - main: connected main cam with buffer size: 4804608 (1928 x 1208)
[55770.706977] selfdrive/modeld/modeld.cc:244 - main: connected extra cam with buffer size: 4804608 (1928 x 1208)
[55770.799899] selfdrive/modeld/modeld.cc:173 - run_model: skipping model eval. Dropped -1 frames
Merged #29080, but I'd still like to fix this properly soon. Moving to the 0.9.5 milestone.
Closing since it doesn't seem worth spending more time on looking into this SNPE issue - the real fix is moving to tinygrad!
Likely stuck in the VisionIPC connect loopactually it could also be stuck in the model init, which was refactored recently. added some logging around that #28920