Closed Kinvy66 closed 10 months ago
"After careful debugging, I believe I've pinpointed where the issue lies, but I'm uncertain about how to specifically resolve it. Firstly, my operating environment is as follows:
examples/basic_agent.py
provided by the macad-gym project works without issues.examples/step_1_xxx.py
results in console output similar to what was mentioned earlier. Initially, I assumed the log output indicated the program was running normally. However, only after debugging did I discover that it wasn't running and was stuck in an infinite loop at a specific part of the code.examples/step_1_xxx.py
, I copied the example examples/basic_agent.py
from the macad-gym project to the MAD-ARL project's examples
folder. This was done to simplify the setup, suspecting a Carla-related issue. To avoid other influencing factors, I used the simpler basic_agent.py. Additionally, to better identify the problem, I modified the environment configuration in MAD-ARL's stop_sign_3c_town03.py
, setting 'render' to True.basic_agent.py
, after the Carla server window opens, the program gets stuck in an infinite loop during the client connection phase:
# basic_agent.py:env.reset()->src/macad-gym/carla/multi_env.py:reset()->_init_server();
# Start client
self._client = None
while self._client is None:
try:
self._client = carla.Client("localhost", self._server_port)
self._client.set_timeout(2.0)
self._client.get_server_version() # error
except RuntimeError as re:
if "timeout" not in str(re) and "time-out" not in str(re):
print("Could not connect to Carla server because:", re)
self._client = None
self._client.set_timeout(60.0)
# 're' value: RuntimeError('time-out of 2000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:33809',)
The 're' value suggests the possible error that the Carla server might not have started or there's a connection port error. However, the Carla server interface has already opened, and I've confirmed the port. Hence, the probable cause might be related to startup parameters.
src/macad-gym
in MAD-ARL and found differences. It seems you are not using the latest version of macad-gym, or perhaps you've made modifications.step_1_xxx.py
and macad_agent
to the macad-gym project's example folder still doesn't run. The output remains the same as shown:
== Status ==
Memory usage on this node: 22.2/62.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 17/20 CPUs, 0/1 GPUs, 0.0/28.42 GiB heap, 0.0/6.4 GiB objects
Result logdir: /home/dell/ray_results/MA-Inde-PPO-SSUI3CCARLA
Number of trials: 1 (1 RUNNING)
+--------------------------------------------+----------+-------+
| Trial name | status | loc |
|--------------------------------------------+----------+-------|
| PPO_HomoNcomIndePOIntrxMASS3CTWN3-v0_00000 | RUNNING | |
+--------------------------------------------+----------+-------+
Moreover, the issue is not stuck at the client connection. I'd like to inquire, in which Python file does the aforementioned output occur?"
Hi @Kinvy66,
Sorry for the delayed response. You are right about GPU issue as I also did some debugging on my end today.
1. It comes down to compatible versions of every tool used. For my old system I had a different (and older) cuda version but now I have updated it and therefore MAD-ARL with gpu was also not working for me. I realized that the current MAD-ARL project is using Python 3.6.13. As I changed the python version to Python 3.8.0 and tensorflow to 2.10.0 I was again able to pick the gpu (RTX 3060 in my case with cuda 11.8 and nvidia driver 520.56).
I cross checked with the following commands: import tensorflow as tf print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Therefore, I would first suggest to check whether you need higher version of python and tensorflow for your system requirements.
2. I assume you are also unable to run CARLA. I am unable to run examples/basic_agent.py since I have significantly modified the libraries versions and scripts to make it run. I would suggest to: a. Make sure you follow the CARLA installation steps and other libraries in the same sequence mentioned in the MAD-ARL Readme. b. Make sure you have $CARLA_SERVER in ~/.bashrc
A suggestion, in case you create another folder/repository of MAD-ARL or macad-gym please run the following three commands in sequence. pip install -e . pip install --upgrade pip pip install -e .
This will make sure that macad_gym/carla python files are syned to your new folder and not to some other folder you have been using before in your system.
Hi @AizazSharif
Thanks to the solution you provided, I reinstalled the environment
pip install --e .
pip install --upgrade pip
pip install --upgrade pip
conda install python=3.8
to upgrade the python version, but there are some packages that depend on specific pytho versions, such as numpy, mkl_random, and the version they specify in conda_env.yml is not compatible with versions higher than python 3.6.7.Can you post the conda_env.yml
for the virtual environment (python 3.8) that you can run properly now?
Hi @Kinvy66,
pip install --e . pip install --upgrade pip pip install --e .
numpy == 1.19.5 OR 1.23.4 gym == 0.12.1 OR 0.17.0 OR 0.15.3 tensorflow == 2.10.0 pip install tensorflow-gpu==2.1.0 ray == 0.8.4 ray[rllib] == 0.8.4 ray[tune] == 0.8.4 tf-slim == 1.1.0 mkl-random == 1.0.1 Pillow == 8.4.0 OR 10.0.0
I was able to run the code by trying libraries with versions mentioned above. I have mentioned more versions of numpy, gym etc. to avoid random bugs.
Hi @AizazSharif
Thank you for your patience, 2024 Happy New Year. I installed according to the version of the package you gave me and still have problems. In order to troubleshoot the problem caused by the environment version first, can you follow the steps below to execute the command step by step and send the output info.
Activate the virtual environment you are using
conda activate MAD-ARL
Export the packages installed with pip
. This command lists all the packages installed with pip and their versions.
pip list
Export all packages installed using conda
, also send the output you get when you run this command.
conda list
what version of carla are you using?
Hi @Kinvy66,
Happy new year.
I am familiar with the commands. It will not be very useful to share the pip list and conda list output since I have modified my environments with ongoing Ph.D. projects.
Could you share the error you are facing?
In my previous reply, I mentioned that copying examples/basic_agent.py
from macad-gym into a MAD-ARL project results in a dead loop in the following code.
# Start client
self._client = None
while self._client is None:
try:
self._client = carla.Client("localhost", self._server_port)
self._client.set_timeout(2.0)
self._client.get_server_version() # error
except RuntimeError as re:
if "timeout" not in str(re) and "time-out" not in str(re):
print("Could not connect to Carla server because:", re)
self._client = None
self._client.set_timeout(60.0)
I've solved this problem, it was due to a problem with the parameters of the command to start carla, I've made the following changes:.
self._server_process = subprocess.Popen(
[
SERVER_BINARY,
"-windowed",
"-ResX=",
str(self._env_config["render_x_res"]),
"-ResY=",
str(self._env_config["render_y_res"]),
"-benchmark",
"-fps=20",
"-carla-server",
"-carla-rpc-port={}".format(self._server_port),
"-carla-streaming-port=0",
],
preexec_fn=os.setsid,
stdout=open(log_file, "w"),
)
I'm still having problems running step1_xxx.py
according to the version you provided https://github.com/T3AS/MAD-ARL/issues/3#issuecomment-1873025635 , here's the terminal output when running it, and then one of them gets stuck and won't continue to run.
/home/kinvy/miniconda3/envs/macad/bin/python /home/kinvy/repo/MAD-ARL/examples/step_1_training_victims.py
2024-01-03 20:48:13.875623: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-03 20:48:13.952682: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-01-03 20:48:13.974485: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-03 20:48:14.293994: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/cv2/../../lib64:
2024-01-03 20:48:14.294030: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/cv2/../../lib64:
2024-01-03 20:48:14.294033: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/kinvy/carla_out
--------------------------------------------
/home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
2024-01-03 20:48:14,873 INFO resource_spec.py:204 -- Starting Ray with 29.69 GiB memory available for workers and up to 9.31 GiB for objects. You can adjust these settings with ray.init(memory=
Below is the version of the package I installed, some packages using the version you provide is unable to install (prompt can not be found) or have conflicts, so some packages I did some adjustments!
```bash
Package Version Editable project location
---------------------------- ------------ ----------------------------
absl-py 2.0.0
aiohttp 3.9.1
aiosignal 1.3.1
astunparse 1.6.3
async-timeout 4.0.3
atari-py 0.2.9
attrs 23.2.0
beautifulsoup4 4.12.2
cachetools 4.2.4
carla 0.9.13
certifi 2023.11.17
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 1.3.0
colorama 0.4.6
dm-tree 0.1.8
dpcpp-cpp-rt 2024.0.2
filelock 3.13.1
flatbuffers 23.5.26
frozenlist 1.4.1
future 0.18.3
gast 0.3.3
google 3.0.0
google-auth 1.35.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
GPUtil 1.4.0
grpcio 1.32.0
gym 0.17.0
h5py 2.10.0
idna 3.6
importlib-metadata 7.0.1
importlib-resources 6.1.1
intel-cmplr-lib-rt 2024.0.2
intel-cmplr-lic-rt 2024.0.2
intel-opencl-rt 2024.0.2
intel-openmp 2024.0.2
jsonschema 4.20.0
jsonschema-specifications 2023.12.1
keras 2.10.0
Keras-Preprocessing 1.1.2
libclang 16.0.6
lz4 4.3.3
macad-gym 0.1.3 /home/kinvy/repo/MAD-ARL/src
Markdown 3.5.1
MarkupSafe 2.1.3
mkl 2024.0.0
mkl-random 1.2.2
multidict 6.0.4
networkx 3.1
numpy 1.23.4
oauthlib 3.2.2
opencv-contrib-python 4.9.0.80
opencv-python 4.5.5.64
opencv-python-headless 4.9.0.80
opt-einsum 3.3.0
packaging 23.2
pandas 2.0.3
Pillow 8.4.0
pip 23.3.1
pkgutil_resolve_name 1.3.10
protobuf 3.19.5
py-spy 0.3.14
pyasn1 0.5.1
pyasn1-modules 0.3.0
pygame 2.5.2
pyglet 1.5.0
python-dateutil 2.8.2
pytz 2023.3.post1
PyYAML 6.0.1
ray 0.8.4
redis 5.0.1
referencing 0.32.0
requests 2.31.0
requests-oauthlib 1.3.1
rpds-py 0.16.2
rsa 4.9
scipy 1.4.1
setuptools 68.2.2
six 1.15.0
soupsieve 2.5
tabulate 0.9.0
tbb 2021.11.0
tensorboard 2.10.1
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardX 2.6.2.2
tensorflow 2.10.0
tensorflow-estimator 2.10.0
tensorflow-gpu 2.2.0
tensorflow-io-gcs-filesystem 0.34.0
termcolor 1.1.0
tf-slim 1.1.0
typing-extensions 3.7.4.3
tzdata 2023.4
urllib3 2.1.0
Werkzeug 3.0.1
wheel 0.41.2
wrapt 1.12.1
yarl 1.9.4
zipp 3.17.0
I did try replacing some of the package versions, but I get an error:: NotImplementedError: Cannot convert a symbolic Tensor (car1/cond_1/strided_slice:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported
So, MAD-ARL depends on a specific version of the package to run, if you can run it in your current environment, I sincerely hope you can tell me the version of the package you are using (pip list and conda list output information). Or if you have changed your code, could you please push the latest code. if you can't run the program properly even now, then I'll try again myself with a different version of the package and change your code? T hanks!
if args.redis_address is not None:
# existing cluster
ray.init(redis_address=args.redis_address,object_store_memory=10**10,log_to_driver=True)
else: ray.init(num_gpus=args.num_gpus,object_store_memory=10**10,log_to_driver=True)
This will show you rest of the output. I turned it off in order to only see the output of iterations/episodic training. But turning it on helps alot in pinpointing errors.
I have verified this by running https://github.com/Kinvy66/MAD-ARL-v0.1.5/example locally and getting the same output as yours.
Since your code is probably running fine with this output, I will suggest to run the step1_xxx.py again. In case you face "NotImplementedError: Cannot convert a symbolic Tensor (car1/cond_1/strided_slice:0) to a numpy array" again, downgrade numpy to 1.19.5 or similar and check.
In the same logs, I see warning of ray tune. Kindly run ray[tune]==0.8.4 for avoiding further errors since its required to save results for Tensorboard and future analysis.
Hi @AizazSharif
Thank you for your help, Step1_xxx.py
in MAD-ARL can run normally on my machine. But I have a small question, that is, how to modify this program to use GPU.
(macad) ➜ ~ python
Python 3.8.18 (default, Sep 11 2023, 13:40:15)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2024-01-05 15:26:42.759099: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
# .....
>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
Num GPUs Available: 1
>>>
My GPU information
(macad) ➜ ~ nvidia-smi
Fri Jan 5 15:35:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02 Driver Version: 535.146.02 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4060 Ti On | 00000000:01:00.0 On | N/A |
| 0% 42C P8 12W / 165W | 6667MiB / 16380MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
(macad) ➜ ~ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
2. I try to modify the code:
```python
# step_1_xxx.py
experiment_spec = tune.run_experiments({
"MA-Inde-PPO-SSUI3CCARLA": {
"run": "PPO",
"env": env_name,
"stop": {
"training_iteration": args.num_iters,
"timesteps_total": args.num_steps,
"episodes_total": 1024,
},
"config": {
# ....
},
# add config
"resources_per_trial": {
"cpu": 1,
"gpu": 1,
},
"checkpoint_freq": 5,
"checkpoint_at_end": True,
}
})
But when running, an error will be reported:
File "/home/kinvy/repo/MAD-ARL-clone/examples/step_1_training_victims.py", line 413, in <module>
experiment_spec = tune.run_experiments(experiments=experiment_spec)
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/tune.py", line 393, in run_experiments
return run(
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/tune.py", line 321, in run
runner.step()
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 334, in step
next_trial = self._get_next_trial() # blocking
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 417, in _get_next_trial
self._update_trial_queue(blocking=wait_for_trial)
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 683, in _update_trial_queue
trials = self._search_alg.next_trials()
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/suggest/basic_variant.py", line 73, in next_trials
trials = list(self._trial_generator)
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/suggest/basic_variant.py", line 100, in _generate_trials
yield create_trial_from_spec(
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/config_parser.py", line 173, in create_trial_from_spec
return Trial(
File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial.py", line 200, in __init__
raise ValueError(
ValueError: Resources for <class 'ray.rllib.agents.trainer_template.PPO'> have been automatically set to Resources(cpu=1, gpu=0, memory=0, object_store_memory=0, extra_cpu=2, extra_gpu=0, extra_memory=0, extra_object_store_memory=0, custom_resources={}, extra_custom_resources={}) by its `default_resource_request()` method. Please clear the `resources_per_trial` option.
Killing live carla processes set()
hi @Kinvy66,
For the GPU issue, I would suggest to: i) Uninstall tensorflow and reinstall TensorFlow-GPU only. ii) upgrade tensorflow/tensorflow-gpu to 2.10.0 or higher. Same for ray versions to 0.9.0. iii) If both tensorflow and tensorflow-gpu are installed, they should have the same version in the pip list.
Kindly share the link to the code and highlighted areas where the modification is done. Right now, it's unclear what you're trying to modify.
hi @Kinvy66,
- For the GPU issue, I would suggest to: i) Uninstall tensorflow and reinstall TensorFlow-GPU only. ii) upgrade tensorflow/tensorflow-gpu to 2.10.0 or higher. Same for ray versions to 0.9.0. iii) If both tensorflow and tensorflow-gpu are installed, they should have the same version in the pip list.
- Kindly share the link to the code and highlighted areas where the modification is done. Right now, it's unclear what you're trying to modify.
Step1_xx.py
can make the program use GPU. I modified the following places respectively
# 1
parser.add_argument(
"--num-gpus", default=1, type=int, help="Number of gpus to use. Default=2")
experiment_spec = tune.Experiment( "multi-carla/" + args.model_arch, "PPO", stop={"timesteps_since_restore": args.num_steps}, config=config, resources_per_trial={ "cpu": 1, "gpu": 1 # modify gpu number })
experiment_spec = tune.run_experiments({
"MA-Inde-PPO-SSUI3CCARLA": {
"run": "PPO",
"env": env_name,
"stop": {
"training_iteration": args.num_iters,
"timesteps_total": args.num_steps,
"episodes_total": 1024,
},
"config": {
# ....
},
# add config
resources_per_trial={
"cpu": 1,
"gpu": 1 # modify gpu number
},
"checkpoint_freq": 5,
"checkpoint_at_end": True,
}
})
The modifications of the above three places are invalid, and the GPU cannot be used.
2. In addition, you mention that only TensorFlow-GPU is installed. In the official documentation of TF, there is a description TensorFlow-GPU has been abandoned. The latest TF2 can automatically detect the GPU (if the CUDA driver is installed correctly). In my environment, CUDA and TF can work both without any problems.
3. The Ray version used by MAD-ARL is 0.8.4 (this version is too old). The GPU may not work may be the incorrect configuration of ray. It is upgraded to Ray 0.9.0, but there is no version.
---
In summary, I no longer delve into MAD-RAL. I will use the latest `ray` and `macad-gym` in my research work. Of course, I will still refer to your work and thank you for your help. If there is nothing to add, I will close this issue later.
My computer has a GPU(Nvidia A4000), follow is the
nvidia-smi
command output :Fri Dec 22 10:57:27 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A4000 Off | 00000000:65:00.0 On | Off | | 43% 54C P8 21W / 140W | 4379MiB / 16376MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
When run the
step_1_xxxx.py
,Output the following log in the console:No matter how to adjust the number of GPUS in the program (about line 46,
--num-gpus
), the log always shows 0, but the number of cpus can be adjusted. Addtionnaly, i have try diffrent version of carla, 0.9.4 and 0.9.14 respectively.So what changes do I need to make to run with a gpu.