T3AS / MAD-ARL

Python project for the paper "Adversarial Deep Reinforcement Learning for Improving the Robustness of Multi-agent Autonomous Driving Policies".
https://aizazsharif.github.io/MAD-ARL/
GNU General Public License v3.0
8 stars 2 forks source link

Run code issue, GPU NEVER USED! #3

Closed Kinvy66 closed 9 months ago

Kinvy66 commented 9 months ago

My computer has a GPU(Nvidia A4000), follow is the nvidia-smi command output :

Fri Dec 22 10:57:27 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA RTX A4000 Off | 00000000:65:00.0 On | Off | | 43% 54C P8 21W / 140W | 4379MiB / 16376MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

When run the step_1_xxxx.py,Output the following log in the console:

/home/dell/miniconda3/envs/MAD-ARL/bin/python /home/dell/qqw/repo/MAD-ARL/examples/step_1_training_victims.py 
2023-12-22 10:41:52.803744: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2023-12-22 10:41:52.803833: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2023-12-22 10:41:52.803843: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/home/dell/carla_out --------------------------------------------
2023-12-22 10:41:54,905 INFO resource_spec.py:212 -- Starting Ray with 28.42 GiB memory available for workers and up to 9.31 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2023-12-22 10:41:55,380 INFO services.py:1148 -- View the Ray dashboard at localhost:8265
2023-12-22 10:41:55,579 WARNING sample.py:27 -- DeprecationWarning: wrapping <function <lambda> at 0x7f348cb20c80> with tune.function() is no longer needed
== Status ==
Memory usage on this node: 22.2/62.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 17/20 CPUs, 0/1 GPUs, 0.0/28.42 GiB heap, 0.0/6.4 GiB objects
Result logdir: /home/dell/ray_results/MA-Inde-PPO-SSUI3CCARLA
Number of trials: 1 (1 RUNNING)
+--------------------------------------------+----------+-------+
| Trial name                                 | status   | loc   |
|--------------------------------------------+----------+-------|
| PPO_HomoNcomIndePOIntrxMASS3CTWN3-v0_00000 | RUNNING  |       |
+--------------------------------------------+----------+-------+

No matter how to adjust the number of GPUS in the program (about line 46, --num-gpus ), the log always shows 0, but the number of cpus can be adjusted. Addtionnaly, i have try diffrent version of carla, 0.9.4 and 0.9.14 respectively.

So what changes do I need to make to run with a gpu.

Kinvy66 commented 9 months ago

"After careful debugging, I believe I've pinpointed where the issue lies, but I'm uncertain about how to specifically resolve it. Firstly, my operating environment is as follows:

  1. I separately downloaded the code for MAD-ARL and macad-gym.
  2. Running the examples/basic_agent.py provided by the macad-gym project works without issues.
  3. Running MAD-ARL project's examples/step_1_xxx.py results in console output similar to what was mentioned earlier. Initially, I assumed the log output indicated the program was running normally. However, only after debugging did I discover that it wasn't running and was stuck in an infinite loop at a specific part of the code.
  4. To further investigate, I conducted step-by-step debugging. Instead of using MAD-ARL's examples/step_1_xxx.py, I copied the example examples/basic_agent.py from the macad-gym project to the MAD-ARL project's examples folder. This was done to simplify the setup, suspecting a Carla-related issue. To avoid other influencing factors, I used the simpler basic_agent.py. Additionally, to better identify the problem, I modified the environment configuration in MAD-ARL's stop_sign_3c_town03.py, setting 'render' to True.
  5. With the setup in point 4, when running basic_agent.py, after the Carla server window opens, the program gets stuck in an infinite loop during the client connection phase:
    #  basic_agent.py:env.reset()->src/macad-gym/carla/multi_env.py:reset()->_init_server();
    # Start client
        self._client = None
        while self._client is None:
            try:
                self._client = carla.Client("localhost", self._server_port)
                self._client.set_timeout(2.0)
                self._client.get_server_version()     # error
            except RuntimeError as re:
                if "timeout" not in str(re) and "time-out" not in str(re):
                    print("Could not connect to Carla server because:", re)
                self._client = None
        self._client.set_timeout(60.0)
    # 're' value: RuntimeError('time-out of 2000ms while waiting for the simulator, make sure the simulator is ready and connected to localhost:33809',)

    The 're' value suggests the possible error that the Carla server might not have started or there's a connection port error. However, the Carla server interface has already opened, and I've confirmed the port. Hence, the probable cause might be related to startup parameters.

  6. I compared the original code of macad-gym with src/macad-gym in MAD-ARL and found differences. It seems you are not using the latest version of macad-gym, or perhaps you've made modifications.
  7. Copying step_1_xxx.py and macad_agent to the macad-gym project's example folder still doesn't run. The output remains the same as shown:
    == Status ==
    Memory usage on this node: 22.2/62.5 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 17/20 CPUs, 0/1 GPUs, 0.0/28.42 GiB heap, 0.0/6.4 GiB objects
    Result logdir: /home/dell/ray_results/MA-Inde-PPO-SSUI3CCARLA
    Number of trials: 1 (1 RUNNING)
    +--------------------------------------------+----------+-------+
    | Trial name                                 | status   | loc   |
    |--------------------------------------------+----------+-------|
    | PPO_HomoNcomIndePOIntrxMASS3CTWN3-v0_00000 | RUNNING  |       |
    +--------------------------------------------+----------+-------+

    Moreover, the issue is not stuck at the client connection. I'd like to inquire, in which Python file does the aforementioned output occur?"

AizazSharif commented 9 months ago

Hi @Kinvy66,

Sorry for the delayed response. You are right about GPU issue as I also did some debugging on my end today.

1. It comes down to compatible versions of every tool used. For my old system I had a different (and older) cuda version but now I have updated it and therefore MAD-ARL with gpu was also not working for me. I realized that the current MAD-ARL project is using Python 3.6.13. As I changed the python version to Python 3.8.0 and tensorflow to 2.10.0 I was again able to pick the gpu (RTX 3060 in my case with cuda 11.8 and nvidia driver 520.56).

I cross checked with the following commands: import tensorflow as tf print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Therefore, I would first suggest to check whether you need higher version of python and tensorflow for your system requirements.

2. I assume you are also unable to run CARLA. I am unable to run examples/basic_agent.py since I have significantly modified the libraries versions and scripts to make it run. I would suggest to: a. Make sure you follow the CARLA installation steps and other libraries in the same sequence mentioned in the MAD-ARL Readme. b. Make sure you have $CARLA_SERVER in ~/.bashrc

A suggestion, in case you create another folder/repository of MAD-ARL or macad-gym please run the following three commands in sequence. pip install -e . pip install --upgrade pip pip install -e .

This will make sure that macad_gym/carla python files are syned to your new folder and not to some other folder you have been using before in your system.

Kinvy66 commented 9 months ago

Hi @AizazSharif

Thanks to the solution you provided, I reinstalled the environment

  1. installed the virtual environment according to the MAD-ARL Readme and set the carla environment variable $CARLA_SERVER properly
  2. installed macad-gym in the MAD-ARL working directory with the commands
    pip install --e .
    pip install --upgrade pip
    pip install --upgrade pip
  3. Use conda install python=3.8 to upgrade the python version, but there are some packages that depend on specific pytho versions, such as numpy, mkl_random, and the version they specify in conda_env.yml is not compatible with versions higher than python 3.6.7.

Can you post the conda_env.yml for the virtual environment (python 3.8) that you can run properly now?

AizazSharif commented 9 months ago

Hi @Kinvy66,

  1. A correction for running the commands in sequence

pip install --e . pip install --upgrade pip pip install --e .

  1. I do not have a specific conda_env.yml yet for python==3.8 temporary environment. I can instead share specific versions that you might need.

numpy == 1.19.5 OR 1.23.4 gym == 0.12.1 OR 0.17.0 OR 0.15.3 tensorflow == 2.10.0 pip install tensorflow-gpu==2.1.0 ray == 0.8.4 ray[rllib] == 0.8.4 ray[tune] == 0.8.4 tf-slim == 1.1.0 mkl-random == 1.0.1 Pillow == 8.4.0 OR 10.0.0

I was able to run the code by trying libraries with versions mentioned above. I have mentioned more versions of numpy, gym etc. to avoid random bugs.

Kinvy66 commented 9 months ago

Hi @AizazSharif

Thank you for your patience, 2024 Happy New Year. I installed according to the version of the package you gave me and still have problems. In order to troubleshoot the problem caused by the environment version first, can you follow the steps below to execute the command step by step and send the output info.

  1. Activate the virtual environment you are using

    conda activate MAD-ARL
  2. Export the packages installed with pip. This command lists all the packages installed with pip and their versions.

    pip list
  3. Export all packages installed using conda, also send the output you get when you run this command.

    conda list
  4. what version of carla are you using?

AizazSharif commented 9 months ago

Hi @Kinvy66,

Happy new year.

I am familiar with the commands. It will not be very useful to share the pip list and conda list output since I have modified my environments with ongoing Ph.D. projects.

Could you share the error you are facing?

Kinvy66 commented 9 months ago
  1. In my previous reply, I mentioned that copying examples/basic_agent.py from macad-gym into a MAD-ARL project results in a dead loop in the following code.

    # Start client
        self._client = None
        while self._client is None:
            try:
                self._client = carla.Client("localhost", self._server_port)
                self._client.set_timeout(2.0)
                self._client.get_server_version()     # error
            except RuntimeError as re:
                if "timeout" not in str(re) and "time-out" not in str(re):
                    print("Could not connect to Carla server because:", re)
                self._client = None
        self._client.set_timeout(60.0)

    I've solved this problem, it was due to a problem with the parameters of the command to start carla, I've made the following changes:.

    self._server_process = subprocess.Popen(
                    [
                        SERVER_BINARY,
                        "-windowed",
                        "-ResX=",
                        str(self._env_config["render_x_res"]),
                        "-ResY=",
                        str(self._env_config["render_y_res"]),
                        "-benchmark",
                        "-fps=20",
                        "-carla-server",
                        "-carla-rpc-port={}".format(self._server_port),
                        "-carla-streaming-port=0",
                    ],
                    preexec_fn=os.setsid,
                    stdout=open(log_file, "w"),
                )
  2. I'm still having problems running step1_xxx.py according to the version you provided https://github.com/T3AS/MAD-ARL/issues/3#issuecomment-1873025635 , here's the terminal output when running it, and then one of them gets stuck and won't continue to run.

    
    /home/kinvy/miniconda3/envs/macad/bin/python /home/kinvy/repo/MAD-ARL/examples/step_1_training_victims.py 
    2024-01-03 20:48:13.875623: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2024-01-03 20:48:13.952682: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
    2024-01-03 20:48:13.974485: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
    2024-01-03 20:48:14.293994: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/cv2/../../lib64:
    2024-01-03 20:48:14.294030: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/cv2/../../lib64:
    2024-01-03 20:48:14.294033: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
    /home/kinvy/carla_out 
    --------------------------------------------

/home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32 warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow')) 2024-01-03 20:48:14,873 INFO resource_spec.py:204 -- Starting Ray with 29.69 GiB memory available for workers and up to 9.31 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=). 2024-01-03 20:48:15,199 INFO services.py:1146 -- View the Ray dashboard at localhost:8265 2024-01-03 20:48:15,757 WARNING sample.py:25 -- DeprecationWarning: wrapping <function at 0x7f306e75d040> with tune.function() is no longer needed 2024-01-03 20:48:15,781 ERROR logger.py:193 -- pip install 'ray[tune]' to see TensorBoard files. 2024-01-03 20:48:15,781 WARNING logger.py:307 -- Could not instantiate TBXLogger: cannot import name 'builder' from 'google.protobuf.internal' (/home/kinvy/miniconda3/envs/macad/lib/python3.8/site-packages/google/protobuf/internal/init.py). == Status == Memory usage on this node: 21.5/62.6 GiB Using FIFO scheduling algorithm. Resources requested: 2/20 CPUs, 0/1 GPUs, 0.0/29.69 GiB heap, 0.0/6.4 GiB objects Result logdir: /home/kinvy/ray_results/MA-Inde-PPO-SSUI3CCARLA Number of trials: 1 (1 RUNNING) +--------------------------------------------+----------+-------+ | Trial name | status | loc | |--------------------------------------------+----------+-------| | PPO_HomoNcomIndePOIntrxMASS3CTWN3-v0_00000 | RUNNING | | +--------------------------------------------+----------+-------+


Below is the version of the package I installed, some packages using the version you provide is unable to install (prompt can not be found) or have conflicts, so some packages I did some adjustments!
```bash
Package                      Version      Editable project location
---------------------------- ------------ ----------------------------
absl-py                      2.0.0
aiohttp                      3.9.1
aiosignal                    1.3.1
astunparse                   1.6.3
async-timeout                4.0.3
atari-py                     0.2.9
attrs                        23.2.0
beautifulsoup4               4.12.2
cachetools                   4.2.4
carla                        0.9.13
certifi                      2023.11.17
charset-normalizer           3.3.2
click                        8.1.7
cloudpickle                  1.3.0
colorama                     0.4.6
dm-tree                      0.1.8
dpcpp-cpp-rt                 2024.0.2
filelock                     3.13.1
flatbuffers                  23.5.26
frozenlist                   1.4.1
future                       0.18.3
gast                         0.3.3
google                       3.0.0
google-auth                  1.35.0
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
GPUtil                       1.4.0
grpcio                       1.32.0
gym                          0.17.0
h5py                         2.10.0
idna                         3.6
importlib-metadata           7.0.1
importlib-resources          6.1.1
intel-cmplr-lib-rt           2024.0.2
intel-cmplr-lic-rt           2024.0.2
intel-opencl-rt              2024.0.2
intel-openmp                 2024.0.2
jsonschema                   4.20.0
jsonschema-specifications    2023.12.1
keras                        2.10.0
Keras-Preprocessing          1.1.2
libclang                     16.0.6
lz4                          4.3.3
macad-gym                    0.1.3        /home/kinvy/repo/MAD-ARL/src
Markdown                     3.5.1
MarkupSafe                   2.1.3
mkl                          2024.0.0
mkl-random                   1.2.2
multidict                    6.0.4
networkx                     3.1
numpy                        1.23.4
oauthlib                     3.2.2
opencv-contrib-python        4.9.0.80
opencv-python                4.5.5.64
opencv-python-headless       4.9.0.80
opt-einsum                   3.3.0
packaging                    23.2
pandas                       2.0.3
Pillow                       8.4.0
pip                          23.3.1
pkgutil_resolve_name         1.3.10
protobuf                     3.19.5
py-spy                       0.3.14
pyasn1                       0.5.1
pyasn1-modules               0.3.0
pygame                       2.5.2
pyglet                       1.5.0
python-dateutil              2.8.2
pytz                         2023.3.post1
PyYAML                       6.0.1
ray                          0.8.4
redis                        5.0.1
referencing                  0.32.0
requests                     2.31.0
requests-oauthlib            1.3.1
rpds-py                      0.16.2
rsa                          4.9
scipy                        1.4.1
setuptools                   68.2.2
six                          1.15.0
soupsieve                    2.5
tabulate                     0.9.0
tbb                          2021.11.0
tensorboard                  2.10.1
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorboardX                 2.6.2.2
tensorflow                   2.10.0
tensorflow-estimator         2.10.0
tensorflow-gpu               2.2.0
tensorflow-io-gcs-filesystem 0.34.0
termcolor                    1.1.0
tf-slim                      1.1.0
typing-extensions            3.7.4.3
tzdata                       2023.4
urllib3                      2.1.0
Werkzeug                     3.0.1
wheel                        0.41.2
wrapt                        1.12.1
yarl                         1.9.4
zipp                         3.17.0

I did try replacing some of the package versions, but I get an error:: NotImplementedError: Cannot convert a symbolic Tensor (car1/cond_1/strided_slice:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported


So, MAD-ARL depends on a specific version of the package to run, if you can run it in your current environment, I sincerely hope you can tell me the version of the package you are using (pip list and conda list output information). Or if you have changed your code, could you please push the latest code. if you can't run the program properly even now, then I'll try again myself with a different version of the package and change your code? T hanks!

AizazSharif commented 9 months ago
  1. Based on the output you have mentioned above, it seems like your code is running fine. In order to further see the logs of cars loading and training, change the parameter log_to_driver=True in the lines like below.

if args.redis_address is not None:

num_gpus (& num_cpus) must not be provided when connecting to an

# existing cluster
ray.init(redis_address=args.redis_address,object_store_memory=10**10,log_to_driver=True)

else: ray.init(num_gpus=args.num_gpus,object_store_memory=10**10,log_to_driver=True)

This will show you rest of the output. I turned it off in order to only see the output of iterations/episodic training. But turning it on helps alot in pinpointing errors.

I have verified this by running https://github.com/Kinvy66/MAD-ARL-v0.1.5/example locally and getting the same output as yours.

  1. Since your code is probably running fine with this output, I will suggest to run the step1_xxx.py again. In case you face "NotImplementedError: Cannot convert a symbolic Tensor (car1/cond_1/strided_slice:0) to a numpy array" again, downgrade numpy to 1.19.5 or similar and check.

  2. In the same logs, I see warning of ray tune. Kindly run ray[tune]==0.8.4 for avoiding further errors since its required to save results for Tensorboard and future analysis.

Kinvy66 commented 9 months ago

Hi @AizazSharif

Thank you for your help, Step1_xxx.py in MAD-ARL can run normally on my machine. But I have a small question, that is, how to modify this program to use GPU.

  1. The driver of the GPU and CUDA I have installed, and I can work normally
    (macad) ➜  ~ python
    Python 3.8.18 (default, Sep 11 2023, 13:40:15) 
    [GCC 11.2.0] :: Anaconda, Inc. on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tensorflow as tf
    2024-01-05 15:26:42.759099: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
    # .....
    >>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    Num GPUs Available:  1
    >>> 

    My GPU information

    
    (macad) ➜  ~ nvidia-smi 
    Fri Jan  5 15:35:07 2024       
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  NVIDIA GeForce RTX 4060 Ti     On  | 00000000:01:00.0  On |                  N/A |
    |  0%   42C    P8              12W / 165W |   6667MiB / 16380MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    (macad) ➜  ~ nvcc -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Wed_Sep_21_10:33:58_PDT_2022
    Cuda compilation tools, release 11.8, V11.8.89
    Build cuda_11.8.r11.8/compiler.31833905_0

2. I try to modify the code:
```python
# step_1_xxx.py
experiment_spec = tune.run_experiments({
            "MA-Inde-PPO-SSUI3CCARLA": {
                "run": "PPO",
                "env": env_name,
                "stop": {
                    "training_iteration": args.num_iters,
                    "timesteps_total": args.num_steps,
                    "episodes_total": 1024,
                },

                "config": {
                    # ....
                },

                 # add config
                "resources_per_trial": {
                    "cpu": 1,
                    "gpu": 1,
                },
                "checkpoint_freq": 5,
                "checkpoint_at_end": True,
            }
        })

But when running, an error will be reported:


  File "/home/kinvy/repo/MAD-ARL-clone/examples/step_1_training_victims.py", line 413, in <module>
    experiment_spec = tune.run_experiments(experiments=experiment_spec)
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/tune.py", line 393, in run_experiments
    return run(
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/tune.py", line 321, in run
    runner.step()
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 334, in step
    next_trial = self._get_next_trial()  # blocking
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 417, in _get_next_trial
    self._update_trial_queue(blocking=wait_for_trial)
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 683, in _update_trial_queue
    trials = self._search_alg.next_trials()
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/suggest/basic_variant.py", line 73, in next_trials
    trials = list(self._trial_generator)
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/suggest/basic_variant.py", line 100, in _generate_trials
    yield create_trial_from_spec(
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/config_parser.py", line 173, in create_trial_from_spec
    return Trial(
  File "/home/kinvy/miniconda3/envs/macad-clone/lib/python3.8/site-packages/ray/tune/trial.py", line 200, in __init__
    raise ValueError(
ValueError: Resources for <class 'ray.rllib.agents.trainer_template.PPO'> have been automatically set to Resources(cpu=1, gpu=0, memory=0, object_store_memory=0, extra_cpu=2, extra_gpu=0, extra_memory=0, extra_object_store_memory=0, custom_resources={}, extra_custom_resources={}) by its `default_resource_request()` method. Please clear the `resources_per_trial` option.
Killing live carla processes set()
AizazSharif commented 9 months ago

hi @Kinvy66,

  1. For the GPU issue, I would suggest to: i) Uninstall tensorflow and reinstall TensorFlow-GPU only. ii) upgrade tensorflow/tensorflow-gpu to 2.10.0 or higher. Same for ray versions to 0.9.0. iii) If both tensorflow and tensorflow-gpu are installed, they should have the same version in the pip list.

  2. Kindly share the link to the code and highlighted areas where the modification is done. Right now, it's unclear what you're trying to modify.

Kinvy66 commented 9 months ago

hi @Kinvy66,

  1. For the GPU issue, I would suggest to: i) Uninstall tensorflow and reinstall TensorFlow-GPU only. ii) upgrade tensorflow/tensorflow-gpu to 2.10.0 or higher. Same for ray versions to 0.9.0. iii) If both tensorflow and tensorflow-gpu are installed, they should have the same version in the pip list.
  2. Kindly share the link to the code and highlighted areas where the modification is done. Right now, it's unclear what you're trying to modify.
  1. I want to ask how to modify the code in Step1_xx.py can make the program use GPU. I modified the following places respectively
    
    # 1 
    parser.add_argument(
    "--num-gpus", default=1, type=int, help="Number of gpus to use. Default=2")

2

This section seems to be an invalid code, and there is no practical effect

experiment_spec = tune.Experiment( "multi-carla/" + args.model_arch, "PPO", stop={"timesteps_since_restore": args.num_steps}, config=config, resources_per_trial={ "cpu": 1, "gpu": 1 # modify gpu number })

3

experiment_spec = tune.run_experiments({
        "MA-Inde-PPO-SSUI3CCARLA": {
            "run": "PPO",
            "env": env_name,
            "stop": {

                "training_iteration": args.num_iters,
                "timesteps_total": args.num_steps,
                "episodes_total": 1024,
            },

            "config": {

              # ....

            },
             # add config
             resources_per_trial={
                "cpu": 1,
                "gpu": 1    # modify gpu number
              },
            "checkpoint_freq": 5,
            "checkpoint_at_end": True,
        }
    })

The modifications of the above three places are invalid, and the GPU cannot be used.

2. In addition, you mention that only TensorFlow-GPU is installed. In the official documentation of TF, there is a description TensorFlow-GPU has been abandoned. The latest TF2 can automatically detect the GPU (if the CUDA driver is installed correctly). In my environment, CUDA and TF can work both without any problems.

3. The Ray version used by MAD-ARL is 0.8.4 (this version is too old). The GPU may not work may be the incorrect configuration of ray. It is upgraded to Ray 0.9.0, but there is no version.
---

In summary, I no longer delve into MAD-RAL. I will use the latest `ray` and  `macad-gym` in my research work. Of course, I will still refer to your work and thank you for your help. If there is nothing to add, I will close this issue later.