Adversarial environment

The goal is to implement Algorithm 1, where first the main agent trains while acting against the adversary, and second the adversarial agent trains while acting against the main agent. We have an implementation of many policy optimizers in Stable Baselines v3. In order to use the training code they have provided, let's plan to provide a Gym environment that allows two agents to act in one step and provide a separate interface for the main agent and adversary.

Start by modifying the CartPole environment, since it is the same task as the InvertedPendulum used in the original paper.

[x] Interface that allows both agents to act in each step. It's ok if the adversarial agent is a noop for now.
[x] Modify CartPole code to apply adversarial action
[x] Implement the training described in Algorithm 1. First train the main agent while the adversary acts for N_u iterations, then train the adversary while the main agent acts for N_v iterations.

bstee615 / rarl

Adversarial environment #4