facebookresearch / BenchMARL

A collection of MARL benchmarks based on TorchRL
https://benchmarl.readthedocs.io/
MIT License
210 stars 26 forks source link

> Benchmarl automatically makes a video. #106

Closed armansouri9 closed 3 weeks ago

armansouri9 commented 4 weeks ago
          > Benchmarl automatically makes a video.

In particular you might want to set these parameters

https://github.com/facebookresearch/BenchMARL/blob/a9309159d6d46d099bd3d395ef1c80a5227b007e/benchmarl/conf/experiment/base_experiment.yaml#L78-L80

and logges=[wandb]

Hello, good day. I made the changes, but the video is not being created during execution. Please advise. Thank you.

Originally posted by @armansouri9 in https://github.com/facebookresearch/BenchMARL/issues/104#issuecomment-2214131776

matteobettini commented 4 weeks ago

Hello,

Can you send more details about what environbment you are using? Is the script running fine? are other evaluation metrics reported?

majid5776 commented 3 weeks ago

hi. the same issue. when I run this script it doesn't make a video with wandb: python benchmarl/run.py -m task=vmas/discovery algorithm=mappo experiment.max_n_iters=11 experiment.on_policy_collected_frames_per_batch=100 experiment.checkpoint_interval=100 "experiment.loggers=[wandb]" model.activation_class="torch.nn.ReLU" thank u.

matteobettini commented 3 weeks ago

Could you send the full config printed by hydra when the script starts?

important things to note here is to fully read and understand all the evaluation parameters in the experiment config.

also i think the value is nan because your number of collected frames is less than the max_steps or the episode length

matteobettini commented 3 weeks ago

Note that the default evaluation interval is 120_000 frames, so if you collect 100 frames per batch you will be served a video after 120 iterations

majid5776 commented 3 weeks ago

these are my configs: part1 part2

matteobettini commented 3 weeks ago

Note that the default evaluation interval is 120_000 frames, so if you collect 100 frames per batch you will be served a video after 120 iterations

I can confirm that that this is the reason why you are not having any evaluation ran

als the resaon for NaNs is that you are collecting 100 frames per batch with 10 workers which means that you are getting 10 frames per worker at each iteration. Since max_steps is 100 you should see non-NaN rewards after 10 iterations more or less

armansouri9 commented 3 weeks ago

This is my config>>


-base_experiment.yaml :

defaults:

.# The device for collection (e.g. cuda) sampling_device: "cpu" .# The device for training (e.g. cuda) train_device: "cpu" .# The device for the replay buffer of off-policy algorithms (e.g. cuda) buffer_device: "cpu"

.# Whether to share the parameters of the policy within agent groups share_policy_params: True .# If an algorithm and an env support both continuous and discrete actions, what should be preferred prefer_continuous_actions: True .# If False collection is done using a collector (under no grad). If True, collection is done with gradients. collect_with_grad: False

.# Discount factor gamma: 0.9 .# Learning rate lr: 0.00005 .# The epsilon parameter of the adam optimizer adam_eps: 0.000001 .# Clips grad norm if true and clips grad value if false clip_grad_norm: True .# The value for the clipping, if null no clipping clip_grad_val: 5

.# Whether to use soft or hard target updates soft_target_update: True .# If soft_target_update is True, this is its polyak_tau polyak_tau: 0.005 .# If soft_target_update is False, this is the frequency of the hard trarget updates in terms of n_optimizer_steps hard_target_update_frequency: 5

.# When an exploration wrapper is used. This is its initial epsilon for annealing exploration_eps_init: 0.8 .# When an exploration wrapper is used. This is its final epsilon after annealing exploration_eps_end: 0.01 .# Number of frames for annealing of exploration strategy in deterministic policy algorithms .# If null it will default to max_n_frames / 3 exploration_anneal_frames: null

.# The maximum number of experiment iterations before the experiment terminates, exclusive with max_n_frames max_n_iters: 10 .# Number of collected frames before ending, exclusive with max_n_iters max_n_frames: 3_000_000

.# Number of frames collected and each experiment iteration on_policy_collected_frames_per_batch: 6000 .# Number of environments used for collection .# If the environment is vectorized, this will be the number of batched environments. .# Otherwise batching will be simulated and each env will be run sequentially. on_policy_n_envs_per_worker: 10 .# This is the number of times collected_frames_per_batch will be split into minibatches and trained on_policy_n_minibatch_iters: 45 .# In on-policy algorithms the train_batch_size will be equal to the on_policy_collected_frames_per_batch .# and it will be split into minibatches with this number of frames for training on_policy_minibatch_size: 400

.# Number of frames collected and each experiment iteration off_policy_collected_frames_per_batch: 6000 .# Number of environments used for collection .# If the environment is vectorized, this will be the number of batched environments. .# Otherwise batching will be simulated and each env will be run sequentially. off_policy_n_envs_per_worker: 10 .# This is the number of times off_policy_train_batch_size will be sampled from the buffer and trained over. off_policy_n_optimizer_steps: 1000 .# Number of frames used for each off_policy_n_optimizer_steps when training off-policy algorithms off_policy_train_batch_size: 128 .# Maximum number of frames to keep in replay buffer memory for off-policy algorithms off_policy_memory_size: 1_000_000 .# Number of random action frames to prefill the replay buffer with off_policy_init_random_frames: 0

evaluation: True .# Whether to render the evaluation (if rendering is available) render: True .# Frequency of evaluation in terms of collected frames (this should be a multiple of on/off_policy_collected_frames_per_batch) evaluation_interval: 120_000 .# Number of episodes that evaluation is run on evaluation_episodes: 2 .# If True, when stochastic policies are evaluated, their mode is taken, otherwise, if False, they are sampled evaluation_deterministic_actions: True

.# List of loggers to use, options are: wandb, csv, tensorboard, mflow loggers: [wandb] .# Create a json folder as part of the output in the format of marl-eval create_json: True

.# Absolute path to the folder where the experiment will log. .# If null, this will default to the hydra output dir (if using hydra) or to the current folder when the script is run (if not). save_folder: null .# Absolute path to a checkpoint file where the experiment was saved. If null the experiment is started fresh. restore_file: null .# Interval for experiment saving in terms of collected frames (this should be a multiple of on/off_policy_collected_frames_per_batch). .# Set it to 0 to disable checkpointing checkpoint_interval: 0 .# Wether to checkpoint when the experiment is done checkpoint_at_end: True .# How many checkpoints to keep. As new checkpoints are taken, temporally older checkpoints are deleted to keep this number of .# checkpoints. The checkpoint at the end is included in this number. Set to null to keep all checkpoints. keep_checkpoints_num: 3


my command>>

python benchmarl/run.py algorithm=mappo task=vmas/balance "experiment.loggers=[wandb]"


But it is not built.

matteobettini commented 3 weeks ago

In your case, after 3 iterations you should se pannels named eval on wandb. One of them is eval/video

Do you see them?

majid5776 commented 3 weeks ago

I run the simple script to test making a video on balance with this script by default configuration: python run.py -m task=vmas/balance algorithm=mappo but after 19 iterations, got this error: part1 part2

armansouri9 commented 3 weeks ago

file

In your case, after 3 iterations you should se pannels named eval on wandb. One of them is eval/video

Do you see them?

No, such a folder is not visible.

matteobettini commented 3 weeks ago

In your case, after 3 iterations you should se pannels named eval on wandb. One of them is eval/video Do you see them?

No, such a folder is not visible.

It is not a folder, it is a panel on the wandb interface https://wandb.ai/home

If you want a local video you shoud use experiment.loggers=[csv]

matteobettini commented 3 weeks ago

I run the simple script to test making a video on balance with this script by default configuration: python run.py -m task=vmas/balance algorithm=mappo but after 19 iterations, got this error: part1 part2

it seems wandb does not have the privileges to symlink the json files

armansouri9 commented 3 weeks ago

Thank you for your follow up After 10 episodes, it didn't work.

command: python run.py algorithm=mappo task=vmas/balance

result: file

config:

defaults:

.# The device for collection (e.g. cuda) sampling_device: "cpu" .# The device for training (e.g. cuda) train_device: "cpu" .# The device for the replay buffer of off-policy algorithms (e.g. cuda) buffer_device: "cpu"

.# Whether to share the parameters of the policy within agent groups share_policy_params: True .# If an algorithm and an env support both continuous and discrete actions, what should be preferred prefer_continuous_actions: True .# If False collection is done using a collector (under no grad). If True, collection is done with gradients. collect_with_grad: False

.# Discount factor gamma: 0.9 .# Learning rate lr: 0.00005 .# The epsilon parameter of the adam optimizer adam_eps: 0.000001 .# Clips grad norm if true and clips grad value if false clip_grad_norm: True .# The value for the clipping, if null no clipping clip_grad_val: 5

.# Whether to use soft or hard target updates soft_target_update: True .# If soft_target_update is True, this is its polyak_tau polyak_tau: 0.005 .# If soft_target_update is False, this is the frequency of the hard trarget updates in terms of n_optimizer_steps hard_target_update_frequency: 5

.# When an exploration wrapper is used. This is its initial epsilon for annealing exploration_eps_init: 0.8 .# When an exploration wrapper is used. This is its final epsilon after annealing exploration_eps_end: 0.01 .# Number of frames for annealing of exploration strategy in deterministic policy algorithms .# If null it will default to max_n_frames / 3 exploration_anneal_frames: null

.# The maximum number of experiment iterations before the experiment terminates, exclusive with max_n_frames max_n_iters: 10 .# Number of collected frames before ending, exclusive with max_n_iters max_n_frames: 3_000_000

.# Number of frames collected and each experiment iteration on_policy_collected_frames_per_batch: 6000 .# Number of environments used for collection .# If the environment is vectorized, this will be the number of batched environments. .# Otherwise batching will be simulated and each env will be run sequentially. on_policy_n_envs_per_worker: 10 .# This is the number of times collected_frames_per_batch will be split into minibatches and trained on_policy_n_minibatch_iters: 45 .# In on-policy algorithms the train_batch_size will be equal to the on_policy_collected_frames_per_batch .# and it will be split into minibatches with this number of frames for training on_policy_minibatch_size: 400

.# Number of frames collected and each experiment iteration off_policy_collected_frames_per_batch: 6000 .# Number of environments used for collection .# If the environment is vectorized, this will be the number of batched environments. .# Otherwise batching will be simulated and each env will be run sequentially. off_policy_n_envs_per_worker: 10 .# This is the number of times off_policy_train_batch_size will be sampled from the buffer and trained over. off_policy_n_optimizer_steps: 1000 .# Number of frames used for each off_policy_n_optimizer_steps when training off-policy algorithms off_policy_train_batch_size: 128 .# Maximum number of frames to keep in replay buffer memory for off-policy algorithms off_policy_memory_size: 1_000_000 .# Number of random action frames to prefill the replay buffer with off_policy_init_random_frames: 0

evaluation: True .# Whether to render the evaluation (if rendering is available) render: True .# Frequency of evaluation in terms of collected frames (this should be a multiple of on/off_policy_collected_frames_per_batch) evaluation_interval: 120_000 .# Number of episodes that evaluation is run on evaluation_episodes: 2 .# If True, when stochastic policies are evaluated, their mode is taken, otherwise, if False, they are sampled evaluation_deterministic_actions: True

.# List of loggers to use, options are: wandb, csv, tensorboard, mflow loggers: [csv] .# Create a json folder as part of the output in the format of marl-eval create_json: True

.# Absolute path to the folder where the experiment will log. .# If null, this will default to the hydra output dir (if using hydra) or to the current folder when the script is run (if not). save_folder: null .# Absolute path to a checkpoint file where the experiment was saved. If null the experiment is started fresh. restore_file: null .# Interval for experiment saving in terms of collected frames (this should be a multiple of on/off_policy_collected_frames_per_batch). .# Set it to 0 to disable checkpointing checkpoint_interval: 0 .# Wether to checkpoint when the experiment is done checkpoint_at_end: True .# How many checkpoints to keep. As new checkpoints are taken, temporally older checkpoints are deleted to keep this number of .# checkpoints. The checkpoint at the end is included in this number. Set to null to keep all checkpoints. keep_checkpoints_num: 3

matteobettini commented 3 weeks ago

Change evaluation_interval to 6000

armansouri9 commented 3 weeks ago

Done, thanks for your tips.