EdanToledo/Stoix - Githubissues

Distributed Single-Agent Reinforcement Learning End-to-End in JAX

**_stoic - a person who can endure pain or hardship without showing their feelings or complaining._**

Welcome to Stoix! 🏛️

Stoix provides simplified code for quickly iterating on ideas in single-agent reinforcement learning with useful implementations of popular single-agent RL algorithms in JAX allowing for easy parallelisation across devices with JAX's pmap. All implementations are fully compiled with JAX's jit thus making training and environment execution very fast. However, this does require environments written in JAX. For environments not written in JAX, Stoix offers Sebulba systems (see below). Algorithms and their default hyperparameters have not been hyper-optimised for any specific environment and are useful as a starting point for research and/or for initial baselines.

To join us in these efforts, please feel free to reach out, raise issues or read our contribution guidelines (or just star 🌟 to stay up to date with the latest developments)!

Stoix is fully in JAX with substantial speed improvement compared to other popular libraries. We currently provide native support for the Jumanji environment API and wrappers for popular RL environments.

System Design Paradigms

Stoix offers two primary system design paradigms (Podracer Architectures) to cater to different research and deployment needs:

Anakin: Traditional Stoix implementations are fully end-to-end compiled with JAX, focusing on speed and simplicity with native JAX environments. This design paradigm is ideal for setups where all components, including environments, can be optimized using JAX, leveraging the full power of JAX's pmap and jit. For an illustration of the Anakin architecture, see this figure from the Mava technical report.
Sebulba: The Sebulba system introduces flexibility by allowing different devices to be assigned specifically for learning and acting. In this setup, acting devices serve as inference servers for multiple parallel environments, which can be written in any framework, not just JAX. This enables Stoix to be used with a broader range of environments while still benefiting from JAX's speed. For an illustration of the Sebulba architecture, see this animation from the InstaDeep Sebulba implementation.

Not all implementations have both Anakin and Sebulba implementations but effort has gone into making the two implementations as similar as possible to allow easy conversion.

Code Philosophy 🧘

The current code in Stoix was initially largely taken and subsequently adapted from Mava. As Mava develops, Stoix will hopefully adopt their optimisations that are relevant for single-agent RL. Like Mava, Stoix is not designed to be a highly modular library and is not meant to be imported. Our repository focuses on simplicity and clarity in its implementations while utilising the advantages offered by JAX such as pmap and vmap, making it an excellent resource for researchers and practitioners to build upon. Stoix follows a similar design philosophy to CleanRL and PureJaxRL, where we allow for some code duplication to enable readability, easy reuse, and fast adaptation. A notable difference between Stoix and other single-file libraries is that Stoix makes use of abstraction where relevant. It is not intended to be purely educational with research utility as the primary focus. In particular, abstraction is currently used for network architectures, environments, logging, and evaluation.

Overview 🦜

Stoix TLDR

Algorithms: Stoix offers easily hackable, single-file implementations of popular algorithms in pure JAX. You can vectorize algorithm training on a single device using vmap as well as distribute training across multiple devices with pmap (or both). Multi-host support (i.e., vmap/pmap over multiple devices and machines) is coming soon! All implementations include checkpointing to save and resume parameters and training runs.
System Designs: Choose between Anakin systems for fully JAX-optimized workflows or Sebulba systems for flexibility with non-JAX environments.
Hydra Config System: Leverage the Hydra configuration system for efficient and consistent management of experiments, network architectures, and environments. Hydra facilitates the easy addition of new hyperparameters and supports multi-runs and Optuna hyperparameter optimization. No more need to create large bash scripts to run a series of experiments with differing hyperparameters, network architectures or environments.
Advanced Logging: Stoix features advanced and configurable logging, ready for output to the terminal, TensorBoard, and other ML tracking dashboards (WandB and Neptune). It also supports logging experiments in JSON format ready for statistical tests and generating RLiable plots (see the plotting notebook). This enables statistically confident comparisons of algorithms natively.

Stoix currently offers the following building blocks for Single-Agent RL research:

Implementations of Algorithms 🥑

Deep Q-Network (DQN) - Paper
Double DQN (DDQN) - Paper
Dueling DQN - Paper
Categorical DQN (C51) - Paper
Munchausen DQN (M-DQN) Paper
Quantile Regression DQN (QR-DQN) - Paper
DQN with Regularized Q-learning (DQN-Reg) Paper
Rainbow - Paper
REINFORCE With Baseline - Paper
Deep Deterministic Policy Gradient (DDPG) - Paper
Twin Delayed DDPG (TD3) - Paper
Distributed Distributional DDPG (D4PG) - Paper
Soft Actor-Critic (SAC) - Paper
Proximal Policy Optimization (PPO) - Paper
Discovered Policy Optimization (DPO) Paper
Maximum a Posteriori Policy Optimisation (MPO) - Paper
On-Policy Maximum a Posteriori Policy Optimisation (V-MPO) - Paper
Advantage-Weighted Regression (AWR) - Paper
AlphaZero - Paper
MuZero - Paper
Sampled Alpha/Mu-Zero - Paper

Environment Wrappers 🍬

Stoix offers wrappers for:

JAX environments: Gymnax, Jumanji, Brax, XMinigrid, Craftax, POPJym, Navix and even JAXMarl (although using Centralised Controllers).
Non-JAX environments: Envpool and Gymnasium.

Statistically Robust Evaluation 🧪

Stoix natively supports logging to json files which adhere to the standard suggested by Gorsane et al. (2022). This enables easy downstream experiment plotting and aggregation using the tools found in the MARL-eval library.

Performance and Speed 🚀

As the code in Stoix (at the time of creation) was in essence a port of Mava, for further speed comparisons we point to their repo. Additionally, we refer to the PureJaxRL blog post here where the speed benefits of end-to-end JAX systems are discussed. Lastly, we point to the Podracer architectures paper here where these ideas were first discussed and benchmarked.

Below we provide some plots illustrating that Stoix performs equally to that of PureJaxRL but with the added benefit of the code being already set up for pmap distribution over devices as well as the other features provided (algorithm implementations, logging, config system, etc).

ppo dqn

I've also included a plot of the training time for 5e5 steps of PPO as one scales the number of environments. PureJaxRL does not pmap and thus runs on a single a device.

env_scaling

Lastly, please keep in mind for practical use that current networks and hyperparameters for algorithms have not been tuned.

Installation 🎬

At the moment Stoix is not meant to be installed as a library, but rather to be used as a research tool.

You can use Stoix by cloning the repo and pip installing as follows:

git clone https://github.com/EdanToledo/Stoix.git
cd Stoix
pip install -e .

We have tested Stoix on Python 3.10. Note that because the installation of JAX differs depending on your hardware accelerator, we advise users to explicitly install the correct JAX version (see the official installation guide).

Quickstart ⚡

To get started with training your first Stoix system, simply run one of the system files. e.g.,

For an Anakin system:

python stoix/systems/ppo/anakin/ff_ppo.py

or for a Sebulba system:

python stoix/systems/ppo/sebulba/ff_ppo.py arch=sebulba env=envpool/pong network=visual_resnet

Stoix makes use of Hydra for config management. In order to see our default system configs please see the stoix/configs/ directory. A benefit of Hydra is that configs can either be set in config yaml files or overwritten from the terminal on the fly. For an example of running a system on the CartPole environment and changing any hyperparameters, the above code can simply be adapted as follows:

python stoix/systems/ppo/anakin/ff_ppo.py env=gymnax/cartpole system.rollout_length=32 system.decay_learning_rates=True

Additionally, certain implementations such as Dueling DQN are decided by the network architecture but the underlying algorithm stays the same. For example, if you wanted to run Dueling DQN you would simply do:

python stoix/systems/q_learning/ff_dqn.py network=mlp_dueling_dqn

or if you wanted to do dueling C51, you could do:

python stoix/systems/q_learning/ff_c51.py network=mlp_dueling_c51

Important Considerations

If your environment does not have a timestep limit or is not guaranteed to end through some game mechanic, then it is possible for the evaluation to seem as if it is hanging forever thereby stalling the training but in fact your agent is just so good or bad that the episode never finishes. Keep this in mind if you are seeing this behaviour. One solution is to simply add a time step limit or potentially action masking.
Due to the way Stoix is set up, you are not guaranteed to run for exactly the number of timesteps you set. A warning is given at the beginning of a run on the actual number of timesteps that will be run. This value will always be less than or equal to the specified sample budget. To get the exact number of transitions to run, ensure that the number of timesteps is divisible by the rollout length * total_num_envs and additionally ensure that the number of evaluations spaced out throughout training perfectly divide the number of updates to be performed. To see the exact calculation, see the file total_timestep_checker.py. This will give an indication of how the actual number of timesteps is calculated and how you can easily set it up to run the exact amount you desire. Its relatively trivial to do so but it is important to keep in mind.
Optimising the performance and speed for Sebulba systems can be a little tricky as you need to balance the pipeline size, the number of actor threads, etc so keep this in mind when applying an algorithm to a new problem.

Contributing 🤝

Please read our contributing docs for details on how to submit pull requests, our Contributor License Agreement and community guidelines.

Roadmap 🛤️

We plan to iteratively expand Stoix in the following increments:

🌴 Support for more environments as they become available.
🔁 More robust recurrent systems.
- [ ] Add recurrent variants of all systems
- [ ] Allow easy interchangability of recurrent cells/architecture via config
📊 Benchmarks on more environments.
- [ ] Create leaderboard of algorithms
🦾 More algorithm implementations:
- [ ] Muesli - Paper
- [ ] DreamerV3 - Paper
- [ ] R2D2 - Paper
🎮 Self-play 2-player Systems for board games.

Please do follow along as we develop this next phase!

Citing Stoix 📚

If you use Stoix in your work, please cite us:

@misc{toledo2024stoix,
    title={Stoix: Distributed Single-Agent Reinforcement Learning End-to-End in JAX},
    doi = {10.5281/zenodo.10916257},
    author={Edan Toledo},
    month = apr,
    year = {2024},
    url = {https://github.com/EdanToledo/Stoix},
}

Acknowledgements 🙏

We would like to thank the authors and developers of Mava as this was essentially a port of their repo at the time of creation. This helped set up a lot of the infrastructure of logging, evaluation and other utilities.