facebookresearch / BenchMARL

A collection of MARL benchmarks based on TorchRL
https://benchmarl.readthedocs.io/
MIT License
248 stars 33 forks source link

MADDPG Config #86

Closed florin-pop closed 3 months ago

florin-pop commented 3 months ago

Thank you for the amazing work you've put into VMAS and BenchMARL.

I tried and failed to reproduce the results from https://github.com/proroklab/VectorizedMultiAgentSimulator/issues/62 and I am not sure if I am missing some important piece of configuration.

When I run: python benchmarl/run.py task=pettingzoo/simple_reference algorithm=mappo the logs appear to indicate that the training works: the reward increases until it reaches a plateau and the videos show the agents moving.

However, when I try to use MADDPG by running: python benchmarl/run.py task=pettingzoo/simple_reference algorithm=maddpg the training process proceeds, but the agents are not exploring the environment or communicating. This is not singular to this task, as it happens for simple_spread as well. If I try to use MADDPG with VMAS/nagivation it works as expected. I experimented with different versions of PettingZoo and benchMARL, but it didn't seem to make a difference so I'm thinking that I may be missing something.

My config is the default:

algorithm_config:
  desc: null
  value:
    delay_value: true
    loss_function: l2
    share_param_critic: true
    use_tanh_mapping: true
algorithm_name:
  desc: null
  value: maddpg
continuous_actions:
  desc: null
  value: true
environment_name:
  desc: null
  value: pettingzoo
experiment_config:
  desc: null
  value:
    adam_eps: 1.0e-06
    checkpoint_interval: 300000.0
    clip_grad_norm: true
    clip_grad_val: 5.0
    create_json: true
    evaluation: true
    evaluation_deterministic_actions: true
    evaluation_episodes: 200
    evaluation_interval: 60000
    exploration_anneal_frames: 1000000
    exploration_eps_end: 0.01
    exploration_eps_init: 0.8
    gamma: 0.9
    hard_target_update_frequency: 5
    loggers:
    - wandb
    lr: 5.0e-05
    max_n_frames: 10000000
    max_n_iters: null
    off_policy_collected_frames_per_batch: 6000
    off_policy_init_random_frames: 0
    off_policy_memory_size: 1000000
    off_policy_n_envs_per_worker: 60
    off_policy_n_optimizer_steps: 1000
    off_policy_train_batch_size: 128
    on_policy_collected_frames_per_batch: 60000
    on_policy_minibatch_size: 4096
    on_policy_n_envs_per_worker: 600
    on_policy_n_minibatch_iters: 45
    polyak_tau: 0.005
    prefer_continuous_actions: true
    render: true
    restore_file: null
    sampling_device: cuda
    save_folder: artifacts
    share_policy_params: false
    soft_target_update: true
    train_device: cuda
model_config:
  desc: null
  value:
    activation_class: torch.nn.modules.activation.Tanh
    activation_kwargs: null
    layer_class: torch.nn.modules.linear.Linear
    norm_class: null
    norm_kwargs: null
    num_cells:
    - 256
    - 256
model_name:
  desc: null
  value: mlp
on_policy:
  desc: null
  value: false
seed:
  desc: null
  value: 0
task_config:
  desc: null
  value:
    continuous_actions: true
    local_ratio: 0.5
    max_cycles: 100
    task: simple_reference_v3
task_name:
  desc: null
  value: simple_reference
matteobettini commented 3 months ago

Hello!

This is because it might need some hyperparameter tuning. VMAS parameters have been tuned for MADDPG in the fine_tuned folder and that is why you see VMAS working. In particular, here are some points:

florin-pop commented 3 months ago

Thank you for the lightning fast response!

I would definitely prefer to use VMAS/MPE over PettingZoo/MPE due to its speed. The reason I started with PettingZoo was the fact that I wanted to have a baseline as starting point for our experiments and my understanding from reading the https://github.com/proroklab/VectorizedMultiAgentSimulator/issues/62 was that you used the VMAS fine-tuned parameters for training agents in PettingZoo as well so I copied those parameters.

evaluation_deterministic_actions: False solved the problem with seeing the agents explore the PettingZoo environment, is this not needed for VMAS/MPE with MADDPG? i noticed that it's true by default

matteobettini commented 3 months ago

I would definitely prefer to use VMAS/MPE over PettingZoo/MPE due to its speed. The reason I started with PettingZoo was the fact that I wanted to have a baseline as starting point for our experiments and my understanding from reading the proroklab/VectorizedMultiAgentSimulator#62 was that you used the VMAS fine-tuned parameters for training agents in PettingZoo as well so I copied those parameters.

VMAS hyperparameters will not work out of the box for pettingzoo, they might be requiring some adjustments. I think in that post i used vmas.

Also, parameter sharing is crucial to be OFF here, from your config you definitely want share_param_critic: false in the algorithm

evaluation_deterministic_actions: False solved the problem with seeing the agents explore the PettingZoo environment, is this not needed for VMAS/MPE with MADDPG? i noticed that it's true by default

You generally do not want your agents to explore during evaluation, this is why it is True by default. I suggested you to turn it off cause you asked to see them explore. Notice that this has not impact at all on training

florin-pop commented 3 months ago

Thanks Matteo, the agents are learning the task using MADDPG in the VMAS/MPE environment using the following config:


Loaded config:

experiment:
  sampling_device: cuda
  train_device: cuda
  share_policy_params: false
  prefer_continuous_actions: true
  gamma: 0.9
  lr: 5.0e-05
  adam_eps: 1.0e-06
  clip_grad_norm: true
  clip_grad_val: 5.0
  soft_target_update: true
  polyak_tau: 0.005
  hard_target_update_frequency: 5
  exploration_eps_init: 0.8
  exploration_eps_end: 0.01
  exploration_anneal_frames: 1000000
  max_n_iters: null
  max_n_frames: 10000000
  on_policy_collected_frames_per_batch: 60000
  on_policy_n_envs_per_worker: 600
  on_policy_n_minibatch_iters: 45
  on_policy_minibatch_size: 4096
  off_policy_collected_frames_per_batch: 6000
  off_policy_n_envs_per_worker: 60
  off_policy_n_optimizer_steps: 1000
  off_policy_train_batch_size: 128
  off_policy_memory_size: 1000000
  off_policy_init_random_frames: 0
  evaluation: true
  render: true
  evaluation_interval: 120000
  evaluation_episodes: 200
  evaluation_deterministic_actions: true
  loggers:
  - wandb
  create_json: true
  save_folder: artifacts
  restore_file: null
  checkpoint_interval: 300000.0
algorithm:
  share_param_critic: true
  loss_function: l2
  delay_value: true
  use_tanh_mapping: true
task:
  max_steps: 100
model:
  name: mlp
  num_cells:
  - 256
  - 256
  layer_class: torch.nn.Linear
  activation_class: torch.nn.Tanh
  activation_kwargs: null
  norm_class: null
  norm_kwargs: null
critic_model:
  name: mlp
  num_cells:
  - 256
  - 256
  layer_class: torch.nn.Linear
  activation_class: torch.nn.Tanh
  activation_kwargs: null
  norm_class: null
  norm_kwargs: null
seed: 0
maddpg_config:
  share_param_critic: false```

https://github.com/facebookresearch/BenchMARL/assets/13623704/dd37abc6-3505-410f-80d9-34840dcb6fa0

The issue can be marked as resolved.
matteobettini commented 3 months ago

Lovely! Feel free to reach out in case you face more issues!