billtubbs / gym-CartPole-bt-v0

A modified version of the cart-pole OpenAI Gym environment for testing different control policies
13 stars 4 forks source link

Add option for stochastic disturbances to the state #6

Open billtubbs opened 3 years ago

billtubbs commented 3 years ago

I.e.

d(k) = C(z)/D(z)e(k)

where C(z) and D(z) are polynomials that can be specified and e(k) is the current random disturbance variable added to the angle state.

x3(k) = x3(k) + d(k)

mg64ve commented 2 years ago

Hello @billtubbs , interesting project. Is this feature already added or still in planning phase?

billtubbs commented 2 years ago

Hi @mg64ve. Thanks. I haven't added this feature yet. Currently there are only options for normal random disturbances of the states. See the ReadMe for the current versions of the environment that are installed. It also describes some other developments I am working on.

I recently added a partially-observable version. When I get some time I'm going to write a blog post about the performance of reinforcement learning algorithms on this control problem.

Can you comment more on what you would be interested in seeing?

mg64ve commented 2 years ago

I have a passion for ML/RL and I am studying ray/rllib. I was noticing that all examples in rllib are quite simple and they do not include noise. So I came across to your openai/gym environment and I saw it already contains random noise. I would like to know more on how you think to add stocastic noise. Furthermore I am very interested to the results of your tests with RL, do you share something before writing the blog post?

billtubbs commented 2 years ago

These environments are compatible with ray/rllib. You can try them yourself! You just need to register them in ray first as follows:

import gym
from gym_CartPole_BT.envs.cartpole_bt_env import CartPoleBTEnv

from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env

def env_creator(env_config):
    return CartPoleBTEnv(env_config)  # return an env instance

env_name = 'CartPole-BT-dL-v0'
register_env(env_name, env_creator)

# choose environment parameters
env_config = {
    'description': "Basic cart-pendulum system with low random disturbance",
    'disturbances': 'low'
}

analysis = tune.run(
    PPOTrainer, 
    stop={"training_iteration": 100}, 
    config={"env": env_name, "env_config": env_config, "framework": "torch"}
)

I've only done a few simulations so far with default parameters. I would be interested to see if you can train a good RL agent on any of these environments. For me, it is taking a lot of training but I haven't done any hyper parameter tuning yet.

mg64ve commented 2 years ago

Honestly @billtubbs I have already done some tests today with PPO+LSTM but I haven't got good results. So I need to understand more. In case of no random disturbance and variance of initial state, how does this gym enviroment differ from traditional openai cartpole? Here is almost night, I will continue my test tomorrow.

billtubbs commented 2 years ago

The differences between this and OpenAI gym's cart pole are described on the ReadMe page.

The main differences are:

  1. In this environment, the horizontal position of the cart must also be controlled (i.e. kept close to zero)
  2. The action space is continuous and constrained (I'm not sure how RL algorithms handle this).
mg64ve commented 2 years ago

I believe this enviroment is very unstable per definition @billtubbs Basically if I run your test_run_lqr.py I am getting the following:

Initializing environment 'CartPole-BT-dL-v0'...
  k      x  theta     u reward cum_reward
------------------------------------------
  1:  0.000 3.142  -0.0   0.00        0.0
  2:  0.000 3.142   0.2   0.00        0.0
  3:  0.000 3.142  -0.6   0.00        0.0
  4:  0.000 3.142  -0.1   0.00        0.0
  5: -0.000 3.142   0.1   0.00        0.0
  6: -0.000 3.141   0.1   0.00        0.0
  7: -0.001 3.141   0.4   0.00        0.0
  8: -0.001 3.141  -0.2   0.00        0.0
  9: -0.001 3.141  -0.7   0.00        0.0
 10: -0.001 3.141   0.7   0.00        0.0
 11: -0.001 3.141  -0.1   0.00        0.0
 12: -0.001 3.141   0.2   0.00        0.0
 13: -0.002 3.141   0.1   0.00        0.0
 14: -0.002 3.141   0.0   0.00        0.0
 15: -0.002 3.141   0.5   0.00        0.0
 16: -0.001 3.141  -0.5   0.00        0.0
 17: -0.001 3.141   0.0   0.00        0.0
 18: -0.001 3.141   0.1   0.00        0.0
 19: -0.001 3.141   0.2   0.00        0.0
 20: -0.001 3.141   0.3   0.00        0.0
 21: -0.001 3.142   0.3   0.00        0.0
 22: -0.000 3.142  -0.5   0.00        0.0
 23: -0.000 3.142  -0.1   0.00        0.0

As you can see the agent need to apply very small force in the right direction for getting the system stable. But, if this does not happen, that means if the agent does not know what force to apply then the environment looses its stability and big force needs to be applied in orders it can get stabilized. That means the algorithm has very few chances to learn the policy, because what it learns it is not to keep the env stable but how to try to recover the env from a great instability status. I have tried with the following script and also with some reward reshaping keeping the reward between 0 and 1.

import argparse
import os
import time
import numpy as np
import gym
from gym_CartPole_BT.envs.cartpole_bt_env import CartPoleBTEnv

from ray.rllib.utils.test_utils import check_learning_achieved
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env

parser = argparse.ArgumentParser()
parser.add_argument(
    "--run",
    type=str,
    default="PPO",
    help="The RLlib-registered algorithm to use.")
parser.add_argument("--render", action="store_true", default=False)
parser.add_argument("--show", action="store_true", default=True)
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument(
    "--framework",
    choices=["tf", "tf2", "tfe", "torch"],
    default="torch",
    help="The DL framework specifier.")
parser.add_argument("--eager-tracing", action="store_true")
parser.add_argument("--use-prev-action", action="store_true", default=True)
parser.add_argument("--use-prev-reward", action="store_true", default=True)
parser.add_argument(
    "--as-test",
    action="store_true",
    help="Whether this script should be run as a test: --stop-reward must "
    "be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
    "--stop-iters",
    type=int,
    # default=10000,
    default=100,
    help="Number of iterations to train.")
parser.add_argument(
    "--stop-timesteps",
    type=int,
    # default=40000000,
    default=4000000,
    help="Number of timesteps to train.")
parser.add_argument(
    "--stop-reward",
    type=float,
    default=600.0,
    help="Reward at which we stop training.")

# env_config = {'disturbances': 'low'}
env_config = {}

def gym_cartpole_bt(env_config):
    return CartPoleBTEnv(env_config=env_config)

if __name__ == "__main__":
    import ray
    from ray import tune

    args = parser.parse_args()

    ray.init(num_cpus=args.num_cpus or None)

    register_env("gym_cartpole_bt_low", gym_cartpole_bt)

    configs = {
        "PPO": {
            "num_sgd_iter": 5,
            "clip_param": 0.3,
            "vf_clip_param": 10.0,

        },
    }

    config = dict(
        configs[args.run],
        **{
            "env": "gym_cartpole_bt_low",
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "model": {
                "use_lstm": True,
                "lstm_cell_size": 128,
                "max_seq_len": 5,
                "fcnet_hiddens": [128,128],
                "post_fcnet_hiddens": [128,128],
                "lstm_use_prev_action": args.use_prev_action,
                "lstm_use_prev_reward": args.use_prev_reward,
            },
            "framework": args.framework,
            # Run with tracing enabled for tfe/tf2?
            "eager_tracing": args.eager_tracing,
        })

    stop = {
        "training_iteration": args.stop_iters,
        "timesteps_total": args.stop_timesteps,
        "episode_reward_mean": args.stop_reward,
    }

    results = tune.run(args.run, config=config, stop=stop, verbose=2, checkpoint_at_end=True)

    checkpoints = results.get_trial_checkpoints_paths(
        trial=results.get_best_trial("episode_reward_mean", mode="max"),
        metric="episode_reward_mean")

    checkpoint_path = checkpoints[0][0]
    trainer = PPOTrainer(config)
    trainer.restore(checkpoint_path)

    # Inference loop.
    env = gym_cartpole_bt(env_config=env_config)
    obs = env.reset()
    # range(2) b/c h- and c-states of the LSTM.
    lstm_cell_size = 128
    init_state = state = [
            np.zeros([lstm_cell_size], np.float32) for _ in range(2)
    ]

    # Run manual inference loop for n episodes.
    for _ in range(10):
        episode_reward = 0.0
        reward = 0.0
        done = False
        obs = env.reset()
        state = init_state
        prev_a = 0
        prev_r = 0.0

        while not done:
            a, state_out, _ = trainer.compute_single_action(obs, state, prev_action=prev_a, prev_reward=prev_r)
            # a, state_out, _ = trainer.compute_single_action(obs, state)
            obs, reward, done, _ = env.step(a)
            episode_reward += reward
            prev_a = a
            prev_r = reward
            state = state_out
            if args.render:
                env.render()
            if args.show:
                print(f"{env.time_step:3d}: {a[0]:5.1f} {reward:6.2f} {episode_reward:10.1f}")

        print(f"Episode reward={episode_reward}")

    if args.as_test:
        check_learning_achieved(results, args.stop_reward)
    ray.shutdown()

The following is an example of replay:

image

You can see the action varies a lot. I need to think more on this problem.

billtubbs commented 2 years ago

Very nice! Thanks for running these simulations and your interest in this.

I agree, this highlights the problem of trying to control an unstable system such as an inverted pendulum. Without some existing controller in place to stabilize the system, it is very difficult to collect data in the region of the state-space of interest for stabilizing the system. Even when the system is reset to the desired point each episode. This is why I think learning the cart-pole control problem is a lot more difficult than some people think.

One of my goals is to collect state information during the RL training process to highlight this point. What would be the best way to do this? Is there a way to log such data by the agent or would I have to add it to the environment?

mg64ve commented 2 years ago

Does this instability depend on physical attributes of the environment? If it is so, we could think to train this environment across increasing values of physical attributes and increase these values. This is just an idea, I am not sure if it works. For what concerns logging, you can do it in two ways. The first is to change the environment code and add logging to file feature. The second is to use RLLIB callback feature. The first is more feasible. By the way, I am not sure if you know RRLIB has a potential PPO loss function bug, which I am not sure it has been fixed:

https://github.com/ray-project/ray/issues/19291

It would be interesting to try another RL framework:

https://github.com/thu-ml/tianshou

billtubbs commented 2 years ago

The system is inherently unstable because the pendulum is inverted. If you run the system in open loop (without control), the errors grow exponentially as you can see when you run this test script which simulates the environment with actions = 0:

python test_run.py -r -e CartPole-BT-dL-v0

None of the environment parameters will change this much, but obviously, if you increase the length of the pendulum, the errors would grow at a slower rate giving the agent more time to correct. Another thing that might make it a bit easier is to widen the constraints on the magnitude of the control action. But you can't get away from the fact that it is an unstable system. That is why it is used as a benchmark in control engineering studies.

What I find interesting, is that in this environment, there is no problem of delayed rewards. In fact, the agent gets the true, accurate reward (-ve square of the state minus the target state) each timestep. Despite this, and the fact that the optimal controller is a very simple linear state feedback controller, the RL algorithms take a long time to solve it. I wonder if it's because they are designed for much more difficult problems (delayed rewards, very high dimensional / stochastic / discontinuous state spaces etc.) and these capabilities actually make them less able to solve simple control problems quickly.

Thanks for informing me about the RLLIB bug. That's interesting. I previously tested the PPO and SAC agents from stablebaselines and got similar results. I see there is now a stablebaselines3 so it would be good to compare results with both.

From your experience, how much parameter optimization is needed with these environments? Is it acceptable to run them with the defaults or could some hyper-parameter tuning greatly improve the performance (i.e. learning rate)?

mg64ve commented 2 years ago
The main differences are:

    In this environment, the horizontal position of the cart must also be controlled (i.e. kept close to zero)
    The action space is continuous and constrained (I'm not sure how RL algorithms handle this).

Also the reward method is different. If you look at the openAI gym environment it allows a reward of 1.0 for each step taken because it does not have position constraint. Honestly I can't tell you if parameter optimization can solve the issue. In my opinion it should be helpful to have better performance but if you apply to an already well formed problem. In my opinion I would try to easy some of the constraints and try to solve an easier problem, than apply optimization and see if this improve performance, and then I would apply more constraints or parameters that make the environment more unstable.