Open billtubbs opened 3 years ago
Hello @billtubbs , interesting project. Is this feature already added or still in planning phase?
Hi @mg64ve. Thanks. I haven't added this feature yet. Currently there are only options for normal random disturbances of the states. See the ReadMe for the current versions of the environment that are installed. It also describes some other developments I am working on.
I recently added a partially-observable version. When I get some time I'm going to write a blog post about the performance of reinforcement learning algorithms on this control problem.
Can you comment more on what you would be interested in seeing?
I have a passion for ML/RL and I am studying ray/rllib. I was noticing that all examples in rllib are quite simple and they do not include noise. So I came across to your openai/gym environment and I saw it already contains random noise. I would like to know more on how you think to add stocastic noise. Furthermore I am very interested to the results of your tests with RL, do you share something before writing the blog post?
These environments are compatible with ray/rllib. You can try them yourself! You just need to register them in ray first as follows:
import gym
from gym_CartPole_BT.envs.cartpole_bt_env import CartPoleBTEnv
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env
def env_creator(env_config):
return CartPoleBTEnv(env_config) # return an env instance
env_name = 'CartPole-BT-dL-v0'
register_env(env_name, env_creator)
# choose environment parameters
env_config = {
'description': "Basic cart-pendulum system with low random disturbance",
'disturbances': 'low'
}
analysis = tune.run(
PPOTrainer,
stop={"training_iteration": 100},
config={"env": env_name, "env_config": env_config, "framework": "torch"}
)
I've only done a few simulations so far with default parameters. I would be interested to see if you can train a good RL agent on any of these environments. For me, it is taking a lot of training but I haven't done any hyper parameter tuning yet.
Honestly @billtubbs I have already done some tests today with PPO+LSTM but I haven't got good results. So I need to understand more. In case of no random disturbance and variance of initial state, how does this gym enviroment differ from traditional openai cartpole? Here is almost night, I will continue my test tomorrow.
The differences between this and OpenAI gym's cart pole are described on the ReadMe page.
The main differences are:
I believe this enviroment is very unstable per definition @billtubbs Basically if I run your test_run_lqr.py I am getting the following:
Initializing environment 'CartPole-BT-dL-v0'...
k x theta u reward cum_reward
------------------------------------------
1: 0.000 3.142 -0.0 0.00 0.0
2: 0.000 3.142 0.2 0.00 0.0
3: 0.000 3.142 -0.6 0.00 0.0
4: 0.000 3.142 -0.1 0.00 0.0
5: -0.000 3.142 0.1 0.00 0.0
6: -0.000 3.141 0.1 0.00 0.0
7: -0.001 3.141 0.4 0.00 0.0
8: -0.001 3.141 -0.2 0.00 0.0
9: -0.001 3.141 -0.7 0.00 0.0
10: -0.001 3.141 0.7 0.00 0.0
11: -0.001 3.141 -0.1 0.00 0.0
12: -0.001 3.141 0.2 0.00 0.0
13: -0.002 3.141 0.1 0.00 0.0
14: -0.002 3.141 0.0 0.00 0.0
15: -0.002 3.141 0.5 0.00 0.0
16: -0.001 3.141 -0.5 0.00 0.0
17: -0.001 3.141 0.0 0.00 0.0
18: -0.001 3.141 0.1 0.00 0.0
19: -0.001 3.141 0.2 0.00 0.0
20: -0.001 3.141 0.3 0.00 0.0
21: -0.001 3.142 0.3 0.00 0.0
22: -0.000 3.142 -0.5 0.00 0.0
23: -0.000 3.142 -0.1 0.00 0.0
As you can see the agent need to apply very small force in the right direction for getting the system stable. But, if this does not happen, that means if the agent does not know what force to apply then the environment looses its stability and big force needs to be applied in orders it can get stabilized. That means the algorithm has very few chances to learn the policy, because what it learns it is not to keep the env stable but how to try to recover the env from a great instability status. I have tried with the following script and also with some reward reshaping keeping the reward between 0 and 1.
import argparse
import os
import time
import numpy as np
import gym
from gym_CartPole_BT.envs.cartpole_bt_env import CartPoleBTEnv
from ray.rllib.utils.test_utils import check_learning_achieved
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.registry import register_env
parser = argparse.ArgumentParser()
parser.add_argument(
"--run",
type=str,
default="PPO",
help="The RLlib-registered algorithm to use.")
parser.add_argument("--render", action="store_true", default=False)
parser.add_argument("--show", action="store_true", default=True)
parser.add_argument("--num-cpus", type=int, default=0)
parser.add_argument(
"--framework",
choices=["tf", "tf2", "tfe", "torch"],
default="torch",
help="The DL framework specifier.")
parser.add_argument("--eager-tracing", action="store_true")
parser.add_argument("--use-prev-action", action="store_true", default=True)
parser.add_argument("--use-prev-reward", action="store_true", default=True)
parser.add_argument(
"--as-test",
action="store_true",
help="Whether this script should be run as a test: --stop-reward must "
"be achieved within --stop-timesteps AND --stop-iters.")
parser.add_argument(
"--stop-iters",
type=int,
# default=10000,
default=100,
help="Number of iterations to train.")
parser.add_argument(
"--stop-timesteps",
type=int,
# default=40000000,
default=4000000,
help="Number of timesteps to train.")
parser.add_argument(
"--stop-reward",
type=float,
default=600.0,
help="Reward at which we stop training.")
# env_config = {'disturbances': 'low'}
env_config = {}
def gym_cartpole_bt(env_config):
return CartPoleBTEnv(env_config=env_config)
if __name__ == "__main__":
import ray
from ray import tune
args = parser.parse_args()
ray.init(num_cpus=args.num_cpus or None)
register_env("gym_cartpole_bt_low", gym_cartpole_bt)
configs = {
"PPO": {
"num_sgd_iter": 5,
"clip_param": 0.3,
"vf_clip_param": 10.0,
},
}
config = dict(
configs[args.run],
**{
"env": "gym_cartpole_bt_low",
# Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
"num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
"model": {
"use_lstm": True,
"lstm_cell_size": 128,
"max_seq_len": 5,
"fcnet_hiddens": [128,128],
"post_fcnet_hiddens": [128,128],
"lstm_use_prev_action": args.use_prev_action,
"lstm_use_prev_reward": args.use_prev_reward,
},
"framework": args.framework,
# Run with tracing enabled for tfe/tf2?
"eager_tracing": args.eager_tracing,
})
stop = {
"training_iteration": args.stop_iters,
"timesteps_total": args.stop_timesteps,
"episode_reward_mean": args.stop_reward,
}
results = tune.run(args.run, config=config, stop=stop, verbose=2, checkpoint_at_end=True)
checkpoints = results.get_trial_checkpoints_paths(
trial=results.get_best_trial("episode_reward_mean", mode="max"),
metric="episode_reward_mean")
checkpoint_path = checkpoints[0][0]
trainer = PPOTrainer(config)
trainer.restore(checkpoint_path)
# Inference loop.
env = gym_cartpole_bt(env_config=env_config)
obs = env.reset()
# range(2) b/c h- and c-states of the LSTM.
lstm_cell_size = 128
init_state = state = [
np.zeros([lstm_cell_size], np.float32) for _ in range(2)
]
# Run manual inference loop for n episodes.
for _ in range(10):
episode_reward = 0.0
reward = 0.0
done = False
obs = env.reset()
state = init_state
prev_a = 0
prev_r = 0.0
while not done:
a, state_out, _ = trainer.compute_single_action(obs, state, prev_action=prev_a, prev_reward=prev_r)
# a, state_out, _ = trainer.compute_single_action(obs, state)
obs, reward, done, _ = env.step(a)
episode_reward += reward
prev_a = a
prev_r = reward
state = state_out
if args.render:
env.render()
if args.show:
print(f"{env.time_step:3d}: {a[0]:5.1f} {reward:6.2f} {episode_reward:10.1f}")
print(f"Episode reward={episode_reward}")
if args.as_test:
check_learning_achieved(results, args.stop_reward)
ray.shutdown()
The following is an example of replay:
You can see the action varies a lot. I need to think more on this problem.
Very nice! Thanks for running these simulations and your interest in this.
I agree, this highlights the problem of trying to control an unstable system such as an inverted pendulum. Without some existing controller in place to stabilize the system, it is very difficult to collect data in the region of the state-space of interest for stabilizing the system. Even when the system is reset to the desired point each episode. This is why I think learning the cart-pole control problem is a lot more difficult than some people think.
One of my goals is to collect state information during the RL training process to highlight this point. What would be the best way to do this? Is there a way to log such data by the agent or would I have to add it to the environment?
Does this instability depend on physical attributes of the environment? If it is so, we could think to train this environment across increasing values of physical attributes and increase these values. This is just an idea, I am not sure if it works. For what concerns logging, you can do it in two ways. The first is to change the environment code and add logging to file feature. The second is to use RLLIB callback feature. The first is more feasible. By the way, I am not sure if you know RRLIB has a potential PPO loss function bug, which I am not sure it has been fixed:
https://github.com/ray-project/ray/issues/19291
It would be interesting to try another RL framework:
The system is inherently unstable because the pendulum is inverted. If you run the system in open loop (without control), the errors grow exponentially as you can see when you run this test script which simulates the environment with actions = 0:
python test_run.py -r -e CartPole-BT-dL-v0
None of the environment parameters will change this much, but obviously, if you increase the length of the pendulum, the errors would grow at a slower rate giving the agent more time to correct. Another thing that might make it a bit easier is to widen the constraints on the magnitude of the control action. But you can't get away from the fact that it is an unstable system. That is why it is used as a benchmark in control engineering studies.
What I find interesting, is that in this environment, there is no problem of delayed rewards. In fact, the agent gets the true, accurate reward (-ve square of the state minus the target state) each timestep. Despite this, and the fact that the optimal controller is a very simple linear state feedback controller, the RL algorithms take a long time to solve it. I wonder if it's because they are designed for much more difficult problems (delayed rewards, very high dimensional / stochastic / discontinuous state spaces etc.) and these capabilities actually make them less able to solve simple control problems quickly.
Thanks for informing me about the RLLIB bug. That's interesting. I previously tested the PPO and SAC agents from stablebaselines and got similar results. I see there is now a stablebaselines3 so it would be good to compare results with both.
From your experience, how much parameter optimization is needed with these environments? Is it acceptable to run them with the defaults or could some hyper-parameter tuning greatly improve the performance (i.e. learning rate)?
The main differences are:
In this environment, the horizontal position of the cart must also be controlled (i.e. kept close to zero)
The action space is continuous and constrained (I'm not sure how RL algorithms handle this).
Also the reward method is different. If you look at the openAI gym environment it allows a reward of 1.0 for each step taken because it does not have position constraint. Honestly I can't tell you if parameter optimization can solve the issue. In my opinion it should be helpful to have better performance but if you apply to an already well formed problem. In my opinion I would try to easy some of the constraints and try to solve an easier problem, than apply optimization and see if this improve performance, and then I would apply more constraints or parameters that make the environment more unstable.
I.e.
d(k) = C(z)/D(z)e(k)
where C(z) and D(z) are polynomials that can be specified and e(k) is the current random disturbance variable added to the angle state.
x3(k) = x3(k) + d(k)