dmonitoringmodeld never started running

adeebshihadeh commented 1 year ago

~~Likely stuck in the VisionIPC connect loop~~ actually it could also be stuck in the model init, which was refactored recently. added some logging around that #28920

83b2b6de0cb09d7f|2023-07-13--10-10-54

adeebshihadeh commented 1 year ago

@deanlee want to try this one? it's either stuck in the vision ipc connect or the model init

deanlee commented 1 year ago

Checking.

adeebshihadeh commented 1 year ago

Happened again with the logging and appears it's stuck in the DM model init

batman:debug$ ./filter_log_message.py "00605c23b4af39f5|2023-07-14--21-16-49" 0 | grep selfdrive.modeld | grep -v manager.py | grep -v "\/modeld.cc"
[60.259997] selfdrive/modeld/runners/snpemodel.cc:36 - SNPEModel: loaded model with size: 3630942
[60.264494] selfdrive/modeld/runners/snpemodel.cc:36 - SNPEModel: loaded model with size: 4443461
[60.607753] selfdrive/modeld/runners/snpemodel.cc:89 - addInput: adding index 0: input_img
[60.608591] selfdrive/modeld/navmodeld.cc:47 - main: models loaded, navmodeld starting
[60.609128] selfdrive/modeld/navmodeld.cc:56 - main: connected with buffer size: 98304

I think it's either #28598 or some kind of race condition now that we're running the navmodel on the DSP as well.

deanlee commented 1 year ago

https://github.com/commaai/openpilot/pull/28598 looks fine so far. it's more likely some kind of race condition.

adeebshihadeh commented 1 year ago

Looks like it's a race condition or something else related to running the navmodel.

navmodel enabled (#28448) on 6/16, first occurrence is 6/24. modeld refactor merged on 7/6.

deanlee commented 1 year ago

The problem I've found so far is that mapsd exits without releasing MapRenderer, this can lead to deadlock or undefined behaviors in other related modules that use vipc/msgq.

https://github.com/commaai/openpilot/blob/454a5e8a266e9c99d545cd77f91306606559b8f6/selfdrive/navd/main.cc#L17

But if we specify a parent for MapRenderer, on create it on the stack, the MapRenderer will hang on destructuring. vipc/msgq will also not get released.

~I've been testing this for a couple hours and it looks like the QMapboxGL thread is deadlocking on exit, The temporary solution is to explicitly delete all resources except the map in MapRenderer::~MapRenderer, then call m_map->destroyRenderer, which will exit the mapsd with a segfault. although this is terrible, it might be worth a try before we find a final solution.~

update: QOpenGLFunctions is released twice. this causes mapsd to segfalut on exit.

you can see this bugs by create MapRenderer on stack:

MapRenderer m(get_mapbox_settings()); return app.exec();

corrupted size vs. prev_size in fastbins
Aborted (core dumped)

sshane commented 1 year ago

Repro below. You also need to make boardd listen to FakeIgnition:

import time
from cereal.messaging import SubMaster
from common.realtime import DT_MDL
from common.params import Params

def all_dead(managerState):
  # return if all processes that should be dead are dead
  return all([not p.running or p.shouldBeRunning for p in managerState.processes])

if __name__ == "__main__":
  params = Params()
  params.put_bool("FakeIgnition", False)

  sm = SubMaster(["driverStateV2", "managerState", "deviceState"])
  occurrences = 0
  loops = 0

  while 1:
    params.put_bool("FakeIgnition", True)
    sm.update(0)
    dmon_frame = None

    # print('Waiting for driverStateV2')
    st = time.monotonic()
    timeout = 15  # s

    # successful if we get 100 messages from dmonitoringmodeld (2s)
    while time.monotonic() - st < timeout:
      sm.update(0)
      time.sleep(DT_MDL)
      if sm.updated["driverStateV2"]:
        if dmon_frame is None:
          dmon_frame = sm.rcv_frame['driverStateV2']

        if (sm.rcv_frame['driverStateV2'] - dmon_frame) > (2 / DT_MDL):
          # print('Got driverStateV2! Exiting', sm.rcv_frame['driverStateV2'], dmon_frame)
          time.sleep(1)
          break
    else:
      occurrences += 1
      print('WARNING: timed out in 15s waiting for 40 messages from dmonitoringmodeld, occurrences:', occurrences, sm.rcv_frame['driverStateV2'], dmon_frame)
      print('CurrentRoute:', params.get('CurrentRoute'))

    # TODO: is there a better way? we can't check managerState immediately since it takes a while to get the ignition
    # wait for thermald to pick up ignition, then an update from managerState, and THEN it should be safe to check procs
    params.put_bool("FakeIgnition", False)
    while sm['deviceState'].started:
      sm.update(0)
      time.sleep(0.05)

    while not sm.updated['managerState']:
      sm.update(0)
      time.sleep(0.05)

    st = time.monotonic()
    while time.monotonic() - st < timeout:
      sm.update(0)
      time.sleep(0.1)
      if all_dead(sm['managerState']):
        # print('all dead')
        break
    else:
      print('WARNING: timed out waiting for processes to die!', time.monotonic() - st)
      time.sleep(5)

    loops += 1
    if loops % 120 == 0:
      print('Tries so far:', loops, 'occurrences:', occurrences)

Patch: https://github.com/commaai/openpilot/commit/9f02f004530c88c54e6b63e7eee78fee1671a006.patch

sshane commented 1 year ago

Added some logging in snpemodel.cc, here's what we got (note that there should be two of every log since navmodeld and dmonitoringmodeld both go through this path).

It looks like the DM model loaded first, then got stuck somewhere around here? I'm also not sure why prints are missing

https://github.com/commaai/openpilot/blob/4cbb195e0b19dcbaf13fa886890f24c29e6b2755/selfdrive/modeld/runners/snpemodel.cc#L37-L60

[55769.278756] selfdrive/modeld/runners/snpemodel.cc:28 - SNPEModel: got runtime, models/navmodel_q.dlc
[55769.317072] selfdrive/modeld/runners/snpemodel.cc:37 - SNPEModel: loaded model with size: 4443461, models/dmonitoring_model_q.dlc
[55769.317815] selfdrive/modeld/runners/snpemodel.cc:41 - SNPEModel: snpe_builder, models/dmonitoring_model_q.dlc
[55769.318091] selfdrive/modeld/runners/snpemodel.cc:43 - SNPEModel: !snpe, models/dmonitoring_model_q.dlc
[55769.413642] selfdrive/modeld/runners/snpemodel.cc:37 - SNPEModel: loaded model with size: 3630942, models/navmodel_q.dlc
[55769.414311] selfdrive/modeld/runners/snpemodel.cc:41 - SNPEModel: snpe_builder, models/navmodel_q.dlc
[55769.414547] selfdrive/modeld/runners/snpemodel.cc:43 - SNPEModel: !snpe, models/navmodel_q.dlc
[55769.759642] selfdrive/modeld/modeld.cc:209 - main: models loaded, modeld starting
[55769.834323] selfdrive/modeld/runners/snpemodel.cc:56 - SNPEModel: build, models/navmodel_q.dlc
[55769.834793] selfdrive/modeld/runners/snpemodel.cc:61 - SNPEModel: got snpe, models/navmodel_q.dlc
[55769.835208] selfdrive/modeld/runners/snpemodel.cc:65 - SNPEModel: created output buffer, models/navmodel_q.dlc
[55769.835591] selfdrive/modeld/runners/snpemodel.cc:68 - SNPEModel: debug print1, models/navmodel_q.dlc
[55769.835965] selfdrive/modeld/runners/snpemodel.cc:73 - SNPEModel: debug print2, models/navmodel_q.dlc
[55769.836337] selfdrive/modeld/runners/snpemodel.cc:75 - SNPEModel: debug print3, models/navmodel_q.dlc
[55769.836704] selfdrive/modeld/runners/snpemodel.cc:82 - SNPEModel: debug print4, models/navmodel_q.dlc
[55769.837124] selfdrive/modeld/runners/snpemodel.cc:84 - SNPEModel: debug print5, models/navmodel_q.dlc
[55769.837508] selfdrive/modeld/runners/snpemodel.cc:86 - SNPEModel: debug print6, models/navmodel_q.dlc
[55769.837878] selfdrive/modeld/runners/snpemodel.cc:98 - addInput: got input tensor names, map
[55769.838253] selfdrive/modeld/runners/snpemodel.cc:102 - addInput: got single tensor name, map
[55769.838623] selfdrive/modeld/runners/snpemodel.cc:104 - addInput: adding index 0: input_img
[55769.838992] selfdrive/modeld/runners/snpemodel.cc:107 - addInput: new debug print1, map
[55769.839364] selfdrive/modeld/runners/snpemodel.cc:112 - addInput: new debug print2, map
[55769.839742] selfdrive/modeld/runners/snpemodel.cc:120 - addInput: new debug print3, map
[55769.840396] selfdrive/modeld/runners/snpemodel.cc:127 - addInput: new debug print4, map
[55769.840799] selfdrive/modeld/runners/snpemodel.cc:130 - addInput: new debug print5, map
[55769.841187] selfdrive/modeld/runners/snpemodel.cc:133 - addInput: new debug print6, map
[55769.841564] selfdrive/modeld/navmodeld.cc:47 - main: models loaded, navmodeld starting
[55769.841991] selfdrive/modeld/navmodeld.cc:56 - main: connected with buffer size: 98304
[55770.661376] selfdrive/modeld/modeld.cc:226 - main: vision stream set up, main_wide_camera: 0, use_extra_client: 1
[55770.706299] selfdrive/modeld/modeld.cc:240 - main: connected main cam with buffer size: 4804608 (1928 x 1208)
[55770.706977] selfdrive/modeld/modeld.cc:244 - main: connected extra cam with buffer size: 4804608 (1928 x 1208)
[55770.799899] selfdrive/modeld/modeld.cc:173 - run_model: skipping model eval. Dropped -1 frames

adeebshihadeh commented 1 year ago

Merged #29080, but I'd still like to fix this properly soon. Moving to the 0.9.5 milestone.

adeebshihadeh commented 1 year ago

Closing since it doesn't seem worth spending more time on looking into this SNPE issue - the real fix is moving to tinygrad!

commaai / openpilot

dmonitoringmodeld never started running #28919