Closed jackz314 closed 2 years ago
When I looked into this the python onxx runner script crashed due to some missing cuda shared libraries. Haven't had the time to fix yet.
I don't think it's a missing library problem, because it works fine sometimes, I think it's More likely a race condition issue.
I also tested OP with cpu only, and it appears CPU provider might have the same problem.
If #20227 is only a temporary fix that only uses CPU, I don't think you should close this as it doesn't fix the problem.
We haven't seen this after the fixes in that PR. @iejMac will be spending some more time this week on making the sim stuff high quality, so if this is still an issue, we'll see it and make sure it's fixed.
I still get this error when running on the fresh openpilot-sim container (sha256:1a618f25cd) on CPU. My steps to repro are:
docker create --net=host \
--name openpilot_client \
--rm \
-it \
--device=/dev/dri \
-v /tmp/.X11-unix:/tmp/.X11-unix \
--shm-size 1G \
-e DISPLAY=$DISPLAY \
-e QT_X11_NO_MITSHM=1 \
ghcr.io/commaai/openpilot-sim \
/bin/bash -c "cd /openpilot/tools/sim && ./tmux_script.sh $*"
(Note: a. fixed ghcr.io
docker repo b. no --gpus all
flag c. create
instead of run
as my carla server is on a server and I need to modify the IP address in bridge.py)
followed by:
docker cp bridge.py openpilot_client:/openpilot/tools/sim/bridge.py
docker start openpilot_client -i
The model works approximatelly one time out of five, I see the UI and can control the car with WASD, OP commands do not work, pressing 1 gives the "open pilot unavailable communication issue between processes" message, which I'm failing to resolve by commenting lines in controlsd.py as indicated in the wiki.
Edit: OP commands do work, managed to fix the controlsd.py, it was line 214 that needed commenting. :+1:
Sorry for the noise, but just wanted to let you know, that after a few successfull runs the model stopped crashing, and I can't replicate the issue anymore. It works everytime :(
OK, I got it to crash a bit more reliably by not giving the display to the image, so running like that:
docker create --net=host --name openpilot_client --rm -it --shm-size 1G -e QT_X11_NO_MITSHM=1 ghcr.io/commaai/openpilot-sim /bin/bash -c "cd /openpilot/tools/sim && ./tmux_script.sh $*"
Still giving me the same issue sometimes on the latest master
I encountered the same error:
_modeld: selfdrive/modeld/runners/onnxmodel.cc:65: void ONNXModel::pwrite(float *, int): Assertion 'err >= 0' failed.
In particular, I also saw the missing shared library mentioned by @pd0wm
radarState: Reader was evicted, reconnecting
Traceback (most recent call last):
File "/home/zhongzzy9/openpilot/selfdrive/modeld/runners/onnx_runner.py", line 9, in <module>
import onnxruntime as ort
File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/__init__.py", line 34, in <module>
raise import_capi_exception
File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/__init__.py", line 23, in <module>
from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \
File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/capi/_pybind_state.py", line 11, in <module>
from . import _ld_preload # noqa: F401
File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/site-packages/onnxruntime/capi/_ld_preload.py", line 12, in <module>
_libcudart = CDLL("libcudart.so.11.0", mode=RTLD_GLOBAL)
File "/home/zhongzzy9/.pyenv/versions/3.8.5/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.11.0: cannot open shared object file: No such file or directory
Starting listener for: camerad
What did people do to address this?
I figured it out. In my case, the cause is onnxruntime-gpu was used but cuda was not installed.
Describe the bug
When starting OP on Ubuntu for simulation, sometimes the following error would occur and modeld won't work. This is likely caused by a race condition somewhere since it works fine sometimes.
The specific reason for this error is writing to a broken pipe according to the
errno
, but I don't know why exactly this happens.How to reproduce or log data
Run OP on Ubuntu with CUDAExecutionProvider
Expected behavior
modeld works properly
Additional context
I tested for this bug multiple times, from what I'm seeing so far, it rarely happens on onnx's normal CPUExecutionProvider, and once I switch to the faster CUDAExecutionProvider, this problem begins to happen more frequently. This could be the race condition itself, or maybe it's just my testing environment.
Operating system: Ubuntu 20.10