onnxmodel pwrite broken pipe on CUDAExecutionProvider

commaai / openpilot

openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 275+ supported cars.

https://comma.ai/openpilot

MIT License

50.03k stars 9.13k forks source link

onnxmodel pwrite broken pipe on CUDAExecutionProvider #19608

Closed jackz314 closed 2 years ago

jackz314 commented 3 years ago

Describe the bug

When starting OP on Ubuntu for simulation, sometimes the following error would occur and modeld won't work. This is likely caused by a race condition somewhere since it works fine sometimes.

_modeld: selfdrive/modeld/runners/onnxmodel.cc:63: void ONNXModel::pwrite(float *, int): Assertion `err >= 0' failed.

The specific reason for this error is writing to a broken pipe according to the errno, but I don't know why exactly this happens.

How to reproduce or log data

Run OP on Ubuntu with CUDAExecutionProvider

Expected behavior

modeld works properly

Additional context

I tested for this bug multiple times, from what I'm seeing so far, it rarely happens on onnx's normal CPUExecutionProvider, and once I switch to the faster CUDAExecutionProvider, this problem begins to happen more frequently. This could be the race condition itself, or maybe it's just my testing environment.

Operating system: Ubuntu 20.10

pd0wm commented 3 years ago

When I looked into this the python onxx runner script crashed due to some missing cuda shared libraries. Haven't had the time to fix yet.

jackz314 commented 3 years ago

I don't think it's a missing library problem, because it works fine sometimes, I think it's More likely a race condition issue.

I also tested OP with cpu only, and it appears CPU provider might have the same problem.

jackz314 commented 3 years ago

If #20227 is only a temporary fix that only uses CPU, I don't think you should close this as it doesn't fix the problem.

adeebshihadeh commented 3 years ago

We haven't seen this after the fixes in that PR. @iejMac will be spending some more time this week on making the sim stuff high quality, so if this is still an issue, we'll see it and make sure it's fixed.

psarka commented 3 years ago

I still get this error when running on the fresh openpilot-sim container (sha256:1a618f25cd) on CPU. My steps to repro are:

docker create --net=host \
  --name openpilot_client \
  --rm \
  -it \
  --device=/dev/dri \
  -v /tmp/.X11-unix:/tmp/.X11-unix \
  --shm-size 1G \
  -e DISPLAY=$DISPLAY \
  -e QT_X11_NO_MITSHM=1 \
  ghcr.io/commaai/openpilot-sim \
  /bin/bash -c "cd /openpilot/tools/sim && ./tmux_script.sh $*"

(Note: a. fixed ghcr.io docker repo b. no --gpus all flag c. create instead of run as my carla server is on a server and I need to modify the IP address in bridge.py)

followed by:

docker cp bridge.py openpilot_client:/openpilot/tools/sim/bridge.py
docker start openpilot_client -i

The model works approximatelly one time out of five, I see the UI and can control the car with WASD, OP commands do not work, pressing 1 gives the "open pilot unavailable communication issue between processes" message, which I'm failing to resolve by commenting lines in controlsd.py as indicated in the wiki.

Edit: OP commands do work, managed to fix the controlsd.py, it was line 214 that needed commenting. :+1:

psarka commented 3 years ago

Sorry for the noise, but just wanted to let you know, that after a few successfull runs the model stopped crashing, and I can't replicate the issue anymore. It works everytime :(

psarka commented 3 years ago

OK, I got it to crash a bit more reliably by not giving the display to the image, so running like that:

docker create --net=host --name openpilot_client --rm -it --shm-size 1G -e QT_X11_NO_MITSHM=1 ghcr.io/commaai/openpilot-sim /bin/bash -c "cd /openpilot/tools/sim && ./tmux_script.sh $*"

ghost commented 3 years ago

Still giving me the same issue sometimes on the latest master

AIasd commented 2 years ago

I encountered the same error:

_modeld: selfdrive/modeld/runners/onnxmodel.cc:65: void ONNXModel::pwrite(float *, int): Assertion 'err >= 0' failed.

In particular, I also saw the missing shared library mentioned by @pd0wm

radarState: Reader was evicted, reconnecting
Traceback (most recent call last):
  File "/home/zhongzzy9/openpilot/selfdrive/modeld/runners/onnx_runner.py", line 9, in <module>
    import onnxruntime as ort
  File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/__init__.py", line 34, in <module>
    raise import_capi_exception
  File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/__init__.py", line 23, in <module>
    from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \
  File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py", line 11, in <module>
    from . import _ld_preload  # noqa: F401
  File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/capi/_ld_preload.py", line 12, in <module>
    _libcudart = CDLL("libcudart.so.11.0", mode=RTLD_GLOBAL)
  File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/ctypes/__init__.py", line 373, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory
Starting listener for: camerad

What did people do to address this?

AIasd commented 2 years ago

I figured it out. In my case, the cause is onnxruntime-gpu was used but cuda was not installed.