kim-mskw commented 1 month ago

Implement the best practices from multi-agent Rl community and stablebaselines3 into our algorithm. Further analyse similarities between petting zoo multi-agent implementation to current RL implementation of Assume. (https://towardsdatascience.com/multi-agent-deep-reinforcement-learning-in-15-lines-of-code-using-pettingzoo-e0b963c0820b)

mthede commented 6 days ago

Quick Review in Code Bases and Literature

Early Stopping

2. Standard Protocol for Cooperative MARL

Meta-analysis on evaluation methodologies of cooperative MARL with proposed recommendations for standardised performance evaluation protocol. Published at NeurIPS 2022. Summary of protocol here.

--> Fixed number of training timesteps and episodes.

A proposed framework to deal with the fragmented community standards and reproducibility issues highlighted by the analysis above. Also some competitive environments. Published in Journal of Machine Learning Research 2024.

BenchMARL is a Multi-Agent Reinforcement Learning (MARL) training library created to enable reproducibility and benchmarking across different MARL algorithms and environments. Its mission is to present a standardized interface that allows easy integration of new algorithms and environments to provide a fair comparison with existing solutions.

--> Not implemented, callbacks may be customized.

4. PettingZoo

--> Not implemented.

Summary

For comparability and according to community standards no early stopping (as default) early_stopping_steps = training_episodes / validation_episodes_interval + 1 early_stopping_threshold = 0

Our current implementation of early stopping is closest to the "no model improvement" callback from SB3. However, suitable default values for steps and threshold are unclear. Should be chosen rather conservatively due to instability of environment and training, but can be useful for experimentation or when dealing with time/computational restrictions. So we'd like to keep the options.

Future development: could be restructured and generalized as callbacks.

mthede commented 6 days ago

Action Noise (Decay)

1. Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance

Published by Jakob Hollenstein et al. (2022)

grafik ▶️ Recommendation (among others regarding noise type, noise scale and impact factor):

In general ▷ use a scheduler [...] Finally we recommend a scheduled reduction of the action noise impact factor β of over the training progress to improve robustness to the action noise configuration.

But type of scheduler didn't seem to be relevent. Performance of linear and logistic scheduler were similar.

2.1 Parameter Space Noise for Exploration

Published by Matthias Plappert et al. (2018)

“add noise directly to the agent’s parameters” instead of added action noise
“adapting the scale of the parameter space noise over time” with simple heuristic for time-varying scale

2.2 OpenAI Baselines

Implementation of Parameter Space Noise for DDPG. Image from OpenAI blog article. grafik ▶️ Quite some overhead and OpenAI Baselines repo is no longer developed.

3. NoisyNet: Noisy Networks for Exploration

Published by Fortunato et al. (2019), also followed by a US Patent (2019, 2024) from DeepMind.

stochastic network layers for exploration

▶️ Would need further manual implementation. Seems to be used in other (more recent) publications.

4. PettingZoo

Action noise decay in tutorials with MARL libraries.

4.1 AgileRL

e.g. MATD3: “manually implemented” exponential decay in tutorial, no scheduler offered by AgileRL

4.2.1 Ray (PettingZoo x DQN):

EpsilonGreedy with annealing

4.2.2 Ray (in total 17 possible choices of exploration with 5 different general schedulers):

e.g. GaussianNoise with PiecewiseSchedule (connected linear schedules):

Summary

NoisyNet or Parameter Space Noise interesting, but unclear if implementation effort would be justified. Generally, scheduling of decaying action noise should be implemented to improve performance.

▶️ Simple (e.g linear) but effective scheduling of action noise scale for now preferred.

If early stopping is enabled: due to scheduling according to fixed number of timesteps/episodes, a warning message needs to be generated that results may be improved because noise decay was not fully performed.

mthede commented 4 days ago

Learning Rate Decay

Sidenote: Currently Adam optimizer is used - Adam adapts/decays the individual learning rates of parameters automatically. However, an additional scheduling of the learning rate may still improve performance as discussed here.

1.1 Stable-Baselines3

General schedule function that can be used for learning rate decay
Default: constant
Implementations: TD3 --> constant; DQN --> linear decay

1.2. SB3 Zoo (Training Framework)

Linear and constant schedule used
Hyperparameter tuning for choice of scheduler, e.g. for PPO

2. Ray RL Library Scheduler

General scheduling capabilities
See comment on action noise decay

3. PyTorch LR Scheduler

15 different schedulers available
"Learning rate scheduling should be applied after optimizer’s update"

4. Learning to Learn Learning-Rate Schedules

Series of conference papers on learning learning rates. Latest publication on a GreedyLR scheduler that ...

... outperforms several state-of-the-art schedulers in terms of accuracy, speed, and convergence. Furthermore, our method is easy to implement, computationally efficient, and requires minimal hyperparameter tuning.

This seems promising but unfortunately no code is provided. It is based on PyTorch's ReduceLROnPlateau scheduler.

Summary

Learning rate scheduling can be beneficial - also for Adam optimizer - and reduce runtimes. Current developments show great potential, but implementation would need to derived from publication. PyTorch offers many schedulers out of the box, but only specifically for the learning rate.

The SB3 implemetation provides a general scheduler which can be used for learning rate and action noise decay. ▶️ As we'd like to implement it anyway for the action noise decay, it can be used for the learning rate as well. It can be discussed in the future, if it should be switched to PyTorch's internal LR Scheduler.

assume-framework / assume

Early Stopping, Learning rate and noise decay #398

Quick Review in Code Bases and Literature

Early Stopping

1. Stable-Baselines3

2. Standard Protocol for Cooperative MARL

3. BenchMARL (Meta Research)

4. PettingZoo

Summary

Action Noise (Decay)

1. Action Noise in Off-Policy Deep Reinforcement Learning: Impact on Exploration and Performance

2.1 Parameter Space Noise for Exploration

2.2 OpenAI Baselines

3. NoisyNet: Noisy Networks for Exploration

4. PettingZoo

4.1 AgileRL

4.2.1 Ray (PettingZoo x DQN):

4.2.2 Ray (in total 17 possible choices of exploration with 5 different general schedulers):

Summary

Learning Rate Decay

1.1 Stable-Baselines3

1.2. SB3 Zoo (Training Framework)

2. Ray RL Library Scheduler

3. PyTorch LR Scheduler

4. Learning to Learn Learning-Rate Schedules

Summary