DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.35k stars 1.6k forks source link

What does the output of model.learn mean? #1934

Closed LeZhengThu closed 1 month ago

LeZhengThu commented 1 month ago

❓ Question

I'm using gymnasium version 0.29.1 and stable_baselines3 version 2.3.2. I'm dealing with a customized env and find that model.learn is not learning anything. So I try to follow the easy examples with 'CartPole-v1' env. However, it seems that it is still not working. Below is the code.

import time
import gymnasium as gym  # version 0.29.1
from stable_baselines3 import PPO    # version 2.3.2
from stable_baselines3.common.env_util import make_vec_env

# Parallel environments
vec_env = make_vec_env("CartPole-v1", n_envs=4)

start_time = time.time()
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=20)
print("--- %s seconds ---" % (time.time() - start_time))
# 7.58s

start_time = time.time()
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=200)
print("--- %s seconds ---" % (time.time() - start_time))
# 6.56s

start_time = time.time()
model = PPO("MlpPolicy", vec_env, verbose=1)
model.learn(total_timesteps=2000)
print("--- %s seconds ---" % (time.time() - start_time))
# 4.38s

I record the execution time and there is no much difference among training for 20, 200, 2000 iterations. This does make any sense. In addition, I don't know how to interpret the output as shown below. I can't tell if this tells me anything.

image

Checklist

araffin commented 1 month ago

Hello, total_timesteps=2000 is not the number of iterations but the minimum number of steps in the env (you see in the logger iterations=1). I would recommend you to take a look at the RL Zoo and the tuned hyperparameters for PPO on CartPole, you need to let it train longer (at least 20_000 steps to have a behavior better than random). And you should learn more about PPO (we have links to resources in our doc).

LeZhengThu commented 1 month ago

@araffin Hello, thanks for the sharing. I set total_timesteps to small numbers to check the functionality of my env. And I also get your point. When I set total_timesteps to 10000, the training is correct now. In addition, I wonder what's the difference between the parameters n_steps and n_epochs in PPO? And is this link '''https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html''' the correct resource to learn PPO?

araffin commented 1 month ago

I set total_timesteps to small numbers to check the functionality of my env.

you have check_env() for that (also documented)

And is this link '''https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html''' the correct resource to learn PPO?

yes and https://stable-baselines3.readthedocs.io/en/master/guide/rl.html