DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.73k stars 1.66k forks source link

[Question] Algorithm from Stable-Baselines 3 does not seem to learn #1826

Closed PBerit closed 7 months ago

PBerit commented 7 months ago

❓ Question

Hi all,

I built a simple custom environment with stable-baselines 3 and gymnsium from this tutorial Shower_Environment. So there is just one state variable which is the temperature of a shower that can be influenced by the action. The action has 3 options: 0--> reduce temperature by 1, 1--> keep temperature, 2:--> increase temperature. There is further a random noise added to the state of the temperature self.state = self.state + random.randint(-1,1). The reward calculation is pretty simple: When the temperature is betwen 37 and 39, the agent gets 1 point, otherwise - 1 point.

Here is the whole code:

from gymnasium import Env
from gymnasium.spaces import Discrete, Box

import numpy as np
import random

from stable_baselines3 import A2C

class ShowerEnv (Env):
    def __init__(self):
        #Define action space
        self.action_space = Discrete(3)

        #Define state space: Temperature range of the water temperature
        self.observation_space = Box (low=np.array([0]), high=np.array([100]))

        #Set starting water temperature
        self.state = 38 + random.randint(-1, 1)

        #Set the shower length in number of time slots
        self.shower_length = 60

    def step(self, action):

        #Apply action
        #Action can have the 3 discrete values (0, 1, 2)
        # When action is 0 --> temperature reduced by 1, When action is 1 --> temperature is kept, When action is 2 --> temperature is increased
        self.state = self.state + action - 1

        #Calculate reward
        if self.state >=37 and self.state <=39:
            reward = 1
        else:
            reward = -1

        #Apply random noise to temperature
        self.state = self.state + random.randint(-1,1)

        #Set placeholder for the info
        info = {}

        #Count down time of shower
        self.shower_length = self.shower_length - 1

        #Chec if shower is finished
        terminated = False
        truncated  = False

        if self.shower_length <=0:
            terminated = True
            truncated  = True

        #Define observation
        observation = self.state

        return observation, reward, terminated, truncated , info

    def reset(self, *, seed=None, options=None):

        super().reset(seed=seed, options=options)

        self.state = 38
        self.shower_length = 60
        observation = self.state

        info = {}

        return observation, info

    def render(self):
        pass

#Create the environment
env = ShowerEnv()

#Test the environment and interact with it

for episode in range (1, 6):
    terminated = False
    state = env.reset()
    cumulative_reward = 0
    while not terminated:
        action= random.choice([0,1, 2])
        state, reward, terminated, truncated , info  = env.step(action)
        cumulative_reward = cumulative_reward + reward
        #print(f"action: {action}")
        #print(f"state: {state}")
        #print(f"reward: {reward}")
        #print(f"terminated: {terminated}")
        #print("")

    print(f"Reward Random Agent -  Episode {episode} = {cumulative_reward}")

#Define the model
model = A2C('MlpPolicy', env, verbose=1, learning_rate=0.0003,ent_coef=0.01)

# train and save the model
number_of_time_steps_training = 100000
model.learn(total_timesteps=number_of_time_steps_training)
model.save('Trained_RL_Models/Shower_Agent')

#Test the trained model
loaded_agent = A2C.load("Trained_RL_Models/Shower_Agent" )

cumulative_reward = 0
for episode in range (1, 6):
    terminated = False
    vec_env = model.get_env()
    observation = vec_env.reset()
    cumulative_reward = 0
    while not terminated:
        # Take an action
        action, _ = loaded_agent.predict(observation, deterministic=False)
        # Observe the resulting state and the reward
        state, reward, terminated, truncated , info = env.step(action)
        cumulative_reward = cumulative_reward + reward
    print(f"Reward Trained Agent -  Episode {episode} = {cumulative_reward}")

I compared the trained agent (100000 steps) with the A2C from stable-baselines to just randomly choosing actions and the results are equally bad:

Reward Random Agent -  Episode 1 = -58
Reward Random Agent -  Episode 2 = -46
Reward Random Agent -  Episode 3 = -40
Reward Random Agent -  Episode 4 = 16
Reward Random Agent -  Episode 5 = -30
---
Reward Trained Agent -  Episode 1 = -36
Reward Trained Agent -  Episode 2 = -48
Reward Trained Agent -  Episode 3 = -6
Reward Trained Agent -  Episode 4 = -10
Reward Trained Agent -  Episode 5 = -50

So it seems that the agent is not learning anything using A2C in my example such that I assume that there is something wrong with the way I apply the stable-baselines 3 algorithm. Can you think of a reason as to why this is happening?

Checklist

araffin commented 7 months ago

Hello, make sure to have a look at https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html (especially the two videos).

Some remarks:

Apart from that, we clearly state that we don't do technical support (in the readme, in the issue template, ...), so I will close this issue.

PBerit commented 7 months ago

@araffin : Thanks for your answer,

here my comments to your remarks:

I also tried to use another reward system. So the problem is really clear and you can clearly see, how the agent gets reward. Still the stable-baselines 3 algorithms don't seem to learn anything.

araffin commented 7 months ago

make sure to have a look at https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html (especially the two videos).

from this: normalize or at least use VecNormalize wrapper for PPO/A2C.

PBerit commented 7 months ago

"from this: normalize or at least use VecNormalize wrapper for PPO/A2C." --> I normalized it now by dividing by 100 (max value). So the observation space is now between 0 and 1. This did not help to get better results. In the tutorial they are using another library for solving the problem and the results are quite good. I think there is something wrong with the connection between the gymnasium environment and the stable-baselines 3 algorithms. The problem is really simple and it is obvious what the agent should do (there are only 3 discrete actions possible). Still, the results uisng stable-baselined 3 are extremely bad (even significantly worse than random guessing). I tried different reward systems but I always get really bad results.

araffin commented 7 months ago

"did you try other algorithms like PPO?" -

I used DQN with:

model = DQN("MlpPolicy", env, verbose=1).learn(100_000, progress_bar=True)

and it reaches a mean reward around 60 at test time.

Same with PPO:


vec_env = make_vec_env(ShowerEnv, n_envs=4)
model = PPO("MlpPolicy", vec_env, n_epochs=4, verbose=1)
model.learn(200_000, progress_bar=True)
mean_reward, std_reward = evaluate_policy(model, env)

with correct truncation and normalization (dividing by 37), I also had to fix the shape of the observation, it was failing the env checker (please read the documentation).

PBerit commented 7 months ago

@araffin : thanks for your answer,

I also now use a numpy array for the observation and the environment checker does not complain any more. Further, I divide by 37 now I trained different models using differenert algorithms from stable-baselines 3. Still, the results are always bad. I get a cumulative mean reward of about -20. So I still have the strong felling, that the agent does not learn as this problem is very easy and there are only 3 actions to choose.

One thing I notices using DQN is that during training, the console output of stable-baselines 3 about the rollout parameter "ep_rew_mean" is very slightly increasing to about 57.5 . However, the end result using the trained agent is still very bad of about -30. When having a look at the rollout parameter "ep_rew_mean" for PPO, the improvements are extremely slow which leads to a ep_rew_mean of about 3 after 200.000 steps (which are way too many for this simple problem; in the tutorial they use 50.000 steps for very good results).

What do you mean by "correct truncation"? In this example I don't think there is a difference between truncation and termination. Just after 60 timeslot the episode terminates and a new ones starts.