ADS-Testing / SAMOTA

Replication of "(ICSE 22) Efficient Online Testing for DNN-based Systems using Surrogate-Assisted and Many-Objective Optimization"
MIT License
1 stars 0 forks source link

No results after a single scenario simulation #6

Open donghwan-shin opened 8 months ago

donghwan-shin commented 8 months ago

Description

run_RS.py does not result in any meaningful simulation results. It turned out that the pylot docker does not generate any simulation results inside.

One can replicate this using the scripts in the debug branch.

Possible solutions

donghwan-shin commented 8 months ago

Just to report: a suspicious warning after running run_RS.py:

ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
donghwan-shin commented 8 months ago

Another suspicious warning:

2023-11-02 15:20:16.630130: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 18874368 exceeds 10% of system memory.
donghwan-shin commented 8 months ago

Another one:

Adding overall evaluation operator...

WARNING: Version mismatch detected: You are trying to connect to a simulator that might be incompatible with this API

WARNING: Client API version     = 784d9b9f

WARNING: Simulator API version  = 0.9.10.1 

..I1102 15:20:13.263662 140518537275136 __init__.py:91] backend TkAgg version unknown
donghwan-shin commented 8 months ago

I managed to get the simulation result for a given scenario vector using the following commands, but this is done by directly connecting to the docker container via two ssh terminals.

Terminal 1

Input:

cd workspace/pylot/scripts/
./run_simulator.sh

Output:

4.24.3-0+++UE4+Release-4.24 518 0
Disabling core dumps.
sh: 1: xdg-user-dir: not found
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5007:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM default
ALSA lib confmisc.c:767:(parse_card) cannot find card '0'
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_card_driver returned error: No such file or directory
ALSA lib confmisc.c:392:(snd_func_concat) error evaluating strings
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_concat returned error: No such file or directory
ALSA lib confmisc.c:1246:(snd_func_refer) error evaluating name
ALSA lib conf.c:4528:(_snd_config_evaluate) function snd_func_refer returned error: No such file or directory
ALSA lib conf.c:5007:(snd_config_expand) Evaluate error: No such file or directory
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM default

Terminal 2

Input (assuming e2e.conf is already copied to workspace/pylot/configs/):

cd workspace/pylot/
python3 pylot.py --flagfile=configs/e2e.conf

NOTE: e2e.conf represents the scenario vector we want to simulate. It is automatically generated by update_file_contents() in implementations/runner/runner.py.

Output:

I1102 17:04:28.021970 140297964369728 __init__.py:409] $HOME=/home/erdos
I1102 17:04:28.022339 140297964369728 __init__.py:409] matplotlib data path /home/erdos/.local/lib/python3.6/site-packages/matplotlib/mpl-data
I1102 17:04:28.027994 140297964369728 __init__.py:1156] loaded rc file /home/erdos/.local/lib/python3.6/site-packages/matplotlib/mpl-data/matplotlibrc
I1102 17:04:28.029642 140297964369728 __init__.py:1879] matplotlib version 2.2.4
I1102 17:04:28.029757 140297964369728 __init__.py:1880] interactive is False
I1102 17:04:28.030380 140297964369728 __init__.py:1881] platform is linux
I1102 17:04:28.030596 140297964369728 __init__.py:1882] loaded modules: [... LOTS OF MODULES]
...
I1102 17:04:28.047724 140297964369728 __init__.py:409] CACHEDIR=/home/erdos/.cache/matplotlib
I1102 17:04:28.049496 140297964369728 font_manager.py:1468] Using fontManager instance from /home/erdos/.cache/matplotlib/fontList.json
I1102 17:04:28.515479 140297964369728 component_creator.py:78] Using obstacle detector...
I1102 17:04:28.516728 140297964369728 component_creator.py:84] Adding obstacle location finder...
I1102 17:04:28.571189 140297964369728 component_creator.py:162] Adding traffic light camera...
I1102 17:04:28.571847 140297964369728 component_creator.py:175] Using traffic light detection...
I1102 17:04:28.575519 140297964369728 component_creator.py:355] Using obstacle tracker...
I1102 17:04:28.577414 140297964369728 component_creator.py:361] Adding operator to compute obstacle location history...
I1102 17:04:28.579459 140297964369728 component_creator.py:428] Using perfect semantic segmentation...
I1102 17:04:28.579588 140297964369728 component_creator.py:254] Using perfect depth estimation...
I1102 17:04:28.579725 140297964369728 component_creator.py:472] Using R2P2 prediction...
I1102 17:04:28.935176 140297964369728 component_creator.py:529] Using behavior planning...
I1102 17:04:28.939563 140297964369728 component_creator.py:533] Using planning...
I1102 17:04:28.941814 140297964369728 component_creator.py:564] Using MPC controller...
I1102 17:04:29.141222 140297964369728 component_creator.py:595] Adding collision logging sensor...
I1102 17:04:29.142883 140297964369728 component_creator.py:599] Adding lane invasion sensor...
I1102 17:04:29.143914 140297964369728 component_creator.py:603] Adding traffic light invasion sensor...
I1102 17:04:29.157925 140297964369728 geos.py:67] Found GEOS DLL: <CDLL '/home/erdos/.local/lib/python3.6/site-packages/shapely/.libs/libgeos_c-bd8d3f16.so.1.10.2', handle 7d29350 at 0x7f9876cfdcc0>, using it.
I1102 17:04:29.168450 140297964369728 geos.py:32] Trying `CDLL(libc.so.6)`
I1102 17:04:29.168728 140297964369728 geos.py:49] Library path: 'libc.so.6'
I1102 17:04:29.168776 140297964369728 geos.py:50] DLL: <CDLL 'libc.so.6', handle 7f99aa548000 at 0x7f9876cfd7f0>
I1102 17:04:29.190011 140297964369728 component_creator.py:608] Adding overall evaluation operator...
WARNING: R2P2 predicts only vehicle trajectories
I1102 17:04:31.698450 140295333193472 __init__.py:91] backend TkAgg version unknown
ERROR: CARLA version 784d9b9f is not supported; assuming this is version 0.9.10
ERROR: CARLA version 784d9b9f is not supported; assuming this is version 0.9.10
moving pedestrians to correct location ## THIS WAS VERY TIME-CONSUMING!
Deep sort model loaded
2023-11-02 17:14:50.826042: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once.
2023-11-02 17:14:52.986476: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2023-11-02 17:14:55.697979: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: ptxas exited with non-zero error code 65280, output: ptxas fatal   : Value 'sm_86' is not defined for option 'gpu-name'

Relying on driver to perform ptx compilation. This message will be only logged once.
2023-11-02 17:14:58.182948: W tensorflow/core/common_runtime/bfc_allocator.cc:305] Garbage collection: deallocate free memory regions (i.e., allocations) so that we can re-allocate a larger region to avoid OOM due to memory fragmentation. If you see this message frequently, you are running near the threshold of the available device memory and re-allocation may incur great performance overhead. You may try smaller batch sizes to observe the performance impact. Set TF_ENABLE_GPU_GARBAGE_COLLECTION=false if you'd like to disable this feature.
2023-11-02 17:15:19.164775: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:19.166157: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:28.974251: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:28.974313: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:29.572823: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.38GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:29.572879: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.38GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:30.932750: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.71GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:30.932794: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.71GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:39.989431: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.38GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:39.989487: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.38GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:41.445624: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.71GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2023-11-02 17:15:41.445679: W tensorflow/core/common_runtime/bfc_allocator.cc:239] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.71GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
ERROR: CARLA version 784d9b9f is not supported; assuming this is version 0.9.10
ERROR: CARLA version 784d9b9f is not supported; assuming this is version 0.9.10
I1102 17:15:52.702555 140295921145600 fitness_value_extractor.py:131] Location(x=366.535767, y=2.010409, z=0.242859)>DfC:0.0000,DfV:1000.00,DfP:8.02,DfM:6.35,DT:20.72
Traceback (most recent call last):
  File "/home/erdos/.local/lib/python3.6/site-packages/erdos-0.3.1-py3.6-linux-x86_64.egg/erdos/streams.py", line 114, in internal_callback
    callback(msg, *write_streams)
  File "/home/erdos/.local/lib/python3.6/site-packages/erdos-0.3.1-py3.6-linux-x86_64.egg/erdos/__init__.py", line 294, in wrapper
    return func(*args, **kwargs)
  File "/home/erdos/workspace/pylot/pylot/perception/detection/detection_operator.py", line 142, in on_msg_camera_stream
    msg.frame.camera_setup.height)),
  File "/home/erdos/workspace/pylot/pylot/perception/detection/utils.py", line 61, in __init__
    assert x_min < x_max and y_min < y_max
AssertionError

Although this ends with an assertion error, the following file is generated: /home/erdos/workspace/results/[2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 1, 2, 1, 0, 0], and the context is as follows:

11-02-2023 17:15:52 | Location(x=366.535767, y=2.010409, z=0.242859)>DfC:0.0000,DfV:1000.00,DfP:8.02,DfM:6.35,DT:20.72
donghwan-shin commented 8 months ago

Problem solved!?

It turns out that a single simulation usually does not "finish" by itself, and we need to "stop" at a certain point. I don't understand exactly why, but even if there is no pylot:/home/erdos/workspace/results/finish.txt, there are correct log files generated under pylot:/home/erdos/workspace/results/, which can be copied to the local machine properly after forcefully "stopping" the simulation.

So, I updated run_single_simulation() as follows. https://github.com/ADS-Testing/SAMOTA/blob/4bace75ac2a86ea4d3c3633e71eee25b24315863/implementation/runner/runner.py#L221-L245

I successfully run run_RS.py on my RONIN machine with the following results:

ubuntu@ip-172-19-197-53:~/SAMOTA/implementation/runner/Results$ ls -l
total 140
-rw-rw-r-- 1 ubuntu ubuntu 137463 Nov  2 22:16 '[2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 1, 2, 1, 0, 0]'
-rw-rw-r-- 1 ubuntu ubuntu     34 Nov  2 22:14 '[2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 4, 1, 2, 1, 0, 0]_ex.log'

And this is the fitness scores: DfC_min: 0, DfV_min: 1, DfP_min: 1, DfM_min: 1, DT_max: 0.4382239382239382, traffic_lights_max: 1

If you want to check the configuration of my RONIN machine, please refer to this:

image
ubuntu@ip-172-19-197-53:~/SAMOTA/implementation/runner/Results$ pip freeze
appdirs==1.4.3
attrs==19.3.0
Automat==0.8.0
blinker==1.4
certifi==2019.11.28
chardet==3.0.4
Click==7.0
cloud-init==23.3.1
colorama==0.4.3
command-not-found==0.3
configobj==5.0.6
constantly==15.1.0
cryptography==2.8
cupshelpers==1.0
dbus-python==1.2.16
defer==1.0.6
distlib==0.3.0
distro==1.4.0
distro-info==0.23+ubuntu1.1
ec2-hibinit-agent==1.0.0
entrypoints==0.3
filelock==3.0.12
hibagent==1.0.1
httplib2==0.14.0
hyperlink==19.0.0
idna==2.8
importlib-metadata==1.5.0
incremental==16.10.1
Jinja2==2.10.1
jsonpatch==1.22
jsonpointer==2.0
jsonschema==3.2.0
keyring==18.0.1
language-selector==0.1
launchpadlib==1.10.13
lazr.restfulclient==0.14.2
lazr.uri==1.0.3
MarkupSafe==1.1.0
more-itertools==4.2.0
netifaces==0.10.4
oauthlib==3.1.0
pexpect==4.6.0
pipenv==11.9.0
pyasn1==0.4.2
pyasn1-modules==0.2.1
pycairo==1.16.2
pycups==1.9.73
PyGObject==3.36.0
PyHamcrest==1.9.0
PyJWT==1.7.1
pymacaroons==0.13.0
PyNaCl==1.3.0
pyOpenSSL==19.0.0
pyrsistent==0.15.5
pyserial==3.4
python-apt==2.0.1+ubuntu0.20.4.1
python-dateutil==2.7.3
python-debian==0.1.36+ubuntu1.1
PyYAML==5.3.1
requests==2.22.0
requests-unixsocket==0.2.0
screen-resolution-extra==0.0.0
SecretStorage==2.3.1
service-identity==18.1.0
simplejson==3.16.0
six==1.14.0
sos==4.5.6
ssh-import-id==5.10
systemd-python==234
Twisted==18.9.0
ubuntu-advantage-tools==8001
ubuntu-drivers-common==0.0.0
ufw==0.36
unattended-upgrades==0.1
urllib3==1.25.8
virtualenv==20.0.17
virtualenv-clone==0.3.0
wadllib==1.3.3
xkit==0.0.0
zipp==1.0.0
zope.interface==4.7.1
ubuntu@ip-172-19-197-53:~/SAMOTA/implementation/runner/Results$ docker images
REPOSITORY           TAG       IMAGE ID       CREATED       SIZE
erdosproject/pylot   v0.3.2    e7ad9adf33b1   2 years ago   30.7GB

Remaining issue (edit: 5 Nov 2023)

Still, it's worth checking why we need to "stop" the simulation instead of waiting for its "natural" finish. I guess this is more of Pylot than CARLA.

Answer from the developer (former PhD student):

This value was to stop the scenario after 10 minutes to save ego vehicle being stuck for ever due to no fault of it (e.g., vif is stuck due to collision with traffic light).

rsomers1998 commented 8 months ago

Now we have a scenario running, a new error has appeared when running SAMOTA through: python run_SAMOTA.py

Traceback (most recent call last):
  File "run_SAMOTA.py", line 6, in <module>
    from samota import *
  File "/home/ubuntu/carla/SAMOTA/implementation/runner/lib/samota.py", line 6, in <module>
    from RBF import Model as RBF_Model
  File "/home/ubuntu/carla/SAMOTA/implementation/runner/lib/RBF.py", line 7, in <module>
    from keras import backend as K
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/keras/__init__.py", line 1, in <module>
    from . import utils
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/keras/utils/__init__.py", line 1, in <module>
    from tensorflow.keras.utils import *
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/__init__.py", line 41, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/python/__init__.py", line 53, in <module>
    from tensorflow.core.framework.graph_pb2 import *
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/core/framework/graph_pb2.py", line 16, in <module>
    from tensorflow.core.framework import function_pb2 as tensorflow_dot_core_dot_framework_dot_function__pb2
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/core/framework/function_pb2.py", line 16, in <module>
    from tensorflow.core.framework import attr_value_pb2 as tensorflow_dot_core_dot_framework_dot_attr__value__pb2
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/core/framework/attr_value_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__pb2
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/core/framework/tensor_pb2.py", line 16, in <module>
    from tensorflow.core.framework import resource_handle_pb2 as tensorflow_dot_core_dot_framework_dot_resource__handle__pb2
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/core/framework/resource_handle_pb2.py", line 16, in <module>
    from tensorflow.core.framework import tensor_shape_pb2 as tensorflow_dot_core_dot_framework_dot_tensor__shape__pb2
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/tensorflow/core/framework/tensor_shape_pb2.py", line 36, in <module>
    _descriptor.FieldDescriptor(
  File "/home/ubuntu/miniconda3/envs/carla/lib/python3.8/site-packages/google/protobuf/descriptor.py", line 544, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

A quick fix is as follows but unsure if its possible to fix this in a better way without fighting with package versions again export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

rsomers1998 commented 8 months ago

4304713d0f3c3e337322019e2e80a3a238ad641b: Updated multiple interfaces of pymoo throughout the code to conform with the version in the requirements.txt