A collection of Reinforcement Learning agents
pip install --user git+https://github.com/eleurent/rl-agents
Most experiments can be started by moving to
cd scripts
and running python experiments.py
Usage:
experiments evaluate <environment> <agent> (--train|--test)
[--episodes <count>]
[--seed <str>]
[--analyze]
experiments benchmark <benchmark> (--train|--test)
[--processes <count>]
[--episodes <count>]
[--seed <str>]
experiments -h | --help
Options:
-h --help Show this screen.
--analyze Automatically analyze the experiment results.
--episodes <count> Number of episodes [default: 5].
--processes <count> Number of running processes [default: 4].
--seed <str> Seed the environments and agents.
--train Train the agent.
--test Test the agent.
The evaluate
command allows to evaluate a given agent on a given environment. For instance,
# Train a DQN agent on the CartPole-v0 environment
$ python3 experiments.py evaluate configs/CartPoleEnv/env.json configs/CartPoleEnv/DQNAgent.json --train --episodes=200
Every agent interacts with the environment following a standard interface:
action = agent.act(state)
next_state, reward, done, info = env.step(action)
agent.record(state, action, reward, next_state, done, info)
The environments are described by their gym id
, and module for registration.
{
"id": "CartPole-v0",
"import_module": "gym"
}
And the agents by their class, and configuration dictionary.
{
"__class__": "<class 'rl_agents.agents.deep_q_network.pytorch.DQNAgent'>",
"model": {
"type": "MultiLayerPerceptron",
"layers": [512, 512]
},
"gamma": 0.99,
"n_steps": 1,
"batch_size": 32,
"memory_capacity": 50000,
"target_update": 1,
"exploration": {
"method": "EpsilonGreedy",
"tau": 50000,
"temperature": 1.0,
"final_temperature": 0.1
}
}
If keys are missing from these configurations, values in agent.default_config()
will be used instead.
Finally, a batch of experiments can be scheduled in a benchmark. All experiments are then executed in parallel on several processes.
# Run a benchmark of several agents interacting with environments
$ python3 experiments.py benchmark cartpole_benchmark.json --test --processes=4
A benchmark configuration file contains a list of environment configurations and a list of agent configurations.
{
"environments": ["envs/cartpole.json"],
"agents": ["agents/dqn.json", "agents/mcts.json"]
}
There are several tools available to monitor the agent performances:
metadata.*.json
file.episode_batch.*.stats.json
file. They can be automatically visualised by running scripts/analyze.py
logging.*.log
file. Add the option scripts/experiments.py --verbose
to save with log level DEBUG.tensorboard --logdir <path-to-runs-dir>
The following agents are currently implemented:
VI
Value IterationPerform a Value Iteration to compute the state-action value, and acts greedily with respect to it.
Only compatible with finite-mdp environments, or environments that handle an env.to_finite_mdp()
conversion method.
Reference: Dynamic Programming, Bellman R., Princeton University Press (1957).
CEM
Cross-Entropy MethodA sampling-based planning algorithm, in which sequences of actions are drawn from a prior gaussian distribution. This distribution is iteratively bootstraped by minimizing its cross-entropy to a target distribution approximated by the top-k candidates.
Only compatible with continuous action spaces. The environment is used as an oracle dynamics and reward model.
Reference: A Tutorial on the Cross-Entropy Method, De Boer P-T., Kroese D.P, Mannor S. and Rubinstein R.Y. (2005).
MCTS
Monte-Carlo Tree SearchA world transition model is leveraged for trajectory search. A look-ahead tree is expanded so as to explore the trajectory space and quickly focus around the most promising moves.
References:
UCT
Upper Confidence bounds applied to TreesThe tree is traversed by iteratively applying an optimistic selection rule at each depth, and the value at leaves is estimated by sampling. Empirical evidence shows that this popular algorithms performs well in many applications, but it has been proved theoretically to achieve a much worse performance (doubly-exponential) than uniform planning in some problems.
References:
OPD
Optimistic Planning for Deterministic systemsThis algorithm is tailored for systems with deterministic dynamics and rewards. It exploits the reward structure to achieve a polynomial rate on regret, and behaves efficiently in numerical experiments with dense rewards.
Reference: Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).
OLOP
Open Loop Optimistic PlanningReferences:
Reference: Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).
Reference: Scale-free adaptive planning for deterministic dynamics & discounted rewards, Bartlett P., Gabillon V., Healey J., Valko M. (2019).
RVI
Robust Value IterationA list of possible finite-mdp models is provided in the agent configuration. The MDP ambiguity set is constrained to be rectangular: different models can be selected at every transition.The corresponding robust state-action value is computed so as to maximize the worst-case total reward.
References:
DROP
Discrete Robust Optimistic PlanningThe MDP ambiguity set is assumed to be finite, and is constructed from a list of modifiers to the true environment. The corresponding robust value is approximately computed by Deterministic Optimistic Planning so as to maximize the worst-case total reward.
References:
IRP
Interval-based Robust PlanningWe assume that the MDP is a parametrized dynamical system, whose parameter is uncertain and lies in a continuous ambiguity set. We use interval prediction to compute the set of states that can be reached at any time t, given that uncertainty, and leverage it to evaluate and improve a robust policy.
If the system is Linear Parameter-Varying (LPV) with polytopic uncertainty, an fast and stable interval predictor can be designed. Otherwise, sampling-based approaches can be used instead, with an increased computational load.
References:
DQN
Deep Q-NetworkA neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.
Implemented variants:
References:
FTQ
Fitted-QA Q-function model is trained by performing each step of Value Iteration as a supervised learning procedure applied to a batch of transitions covering most of the state-action space.
Reference: Tree-Based Batch Mode Reinforcement Learning, Ernst D. et al (2005).
BFTQ
Budgeted Fitted-QAn adaptation of FTQ
in the budgeted setting: we maximise the expected reward r of a policy π under the constraint that an expected cost c remains under a given budget β.
The policy _π(a | s, β)_ is conditioned on this cost budget β, which can be changed online.
To that end, the Q-function model is trained to predict both the expected reward Qr and the expected cost Qc of the optimal constrained policy π.
This agent can only be used with environments that provide a cost signal in their info
field:
>>> obs, reward, done, info = env.step(action)
>>> info
{'cost': 1.0}
Reference: Budgeted Reinforcement Learning in Continuous State Space, Carrara N., Leurent E., Laroche R., Urvoy T., Maillard O-A., Pietquin O. (2019).
If you use this project in your work, please consider citing it with:
@misc{rl-agents,
author = {Leurent, Edouard},
title = {rl-agents: Implementations of Reinforcement Learning algorithms},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/eleurent/rl-agents}},
}