Closed PBerit closed 7 months ago
Hello, make sure to have a look at https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html (especially the two videos).
Some remarks:
Apart from that, we clearly state that we don't do technical support (in the readme, in the issue template, ...), so I will close this issue.
@araffin : Thanks for your answer,
here my comments to your remarks:
I also tried to use another reward system. So the problem is really clear and you can clearly see, how the agent gets reward. Still the stable-baselines 3 algorithms don't seem to learn anything.
make sure to have a look at https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html (especially the two videos).
from this: normalize or at least use VecNormalize
wrapper for PPO/A2C.
"from this: normalize or at least use VecNormalize wrapper for PPO/A2C." --> I normalized it now by dividing by 100 (max value). So the observation space is now between 0 and 1. This did not help to get better results. In the tutorial they are using another library for solving the problem and the results are quite good. I think there is something wrong with the connection between the gymnasium environment and the stable-baselines 3 algorithms. The problem is really simple and it is obvious what the agent should do (there are only 3 discrete actions possible). Still, the results uisng stable-baselined 3 are extremely bad (even significantly worse than random guessing). I tried different reward systems but I always get really bad results.
"did you try other algorithms like PPO?" -
I used DQN with:
model = DQN("MlpPolicy", env, verbose=1).learn(100_000, progress_bar=True)
and it reaches a mean reward around 60 at test time.
Same with PPO:
vec_env = make_vec_env(ShowerEnv, n_envs=4)
model = PPO("MlpPolicy", vec_env, n_epochs=4, verbose=1)
model.learn(200_000, progress_bar=True)
mean_reward, std_reward = evaluate_policy(model, env)
with correct truncation and normalization (dividing by 37), I also had to fix the shape of the observation, it was failing the env checker (please read the documentation).
@araffin : thanks for your answer,
I also now use a numpy array for the observation and the environment checker does not complain any more. Further, I divide by 37 now I trained different models using differenert algorithms from stable-baselines 3. Still, the results are always bad. I get a cumulative mean reward of about -20. So I still have the strong felling, that the agent does not learn as this problem is very easy and there are only 3 actions to choose.
One thing I notices using DQN is that during training, the console output of stable-baselines 3 about the rollout parameter "ep_rew_mean" is very slightly increasing to about 57.5 . However, the end result using the trained agent is still very bad of about -30. When having a look at the rollout parameter "ep_rew_mean" for PPO, the improvements are extremely slow which leads to a ep_rew_mean of about 3 after 200.000 steps (which are way too many for this simple problem; in the tutorial they use 50.000 steps for very good results).
What do you mean by "correct truncation"? In this example I don't think there is a difference between truncation and termination. Just after 60 timeslot the episode terminates and a new ones starts.
❓ Question
Hi all,
I built a simple custom environment with stable-baselines 3 and gymnsium from this tutorial Shower_Environment. So there is just one state variable which is the temperature of a shower that can be influenced by the action. The action has 3 options: 0--> reduce temperature by 1, 1--> keep temperature, 2:--> increase temperature. There is further a random noise added to the state of the temperature
self.state = self.state + random.randint(-1,1)
. The reward calculation is pretty simple: When the temperature is betwen 37 and 39, the agent gets 1 point, otherwise - 1 point.Here is the whole code:
I compared the trained agent (100000 steps) with the A2C from stable-baselines to just randomly choosing actions and the results are equally bad:
So it seems that the agent is not learning anything using A2C in my example such that I assume that there is something wrong with the way I apply the stable-baselines 3 algorithm. Can you think of a reason as to why this is happening?
Checklist