DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.06k stars 1.7k forks source link

SAC model does not log metrics on tensorboard #1554

Closed stelladk closed 1 year ago

stelladk commented 1 year ago

🐛 Bug

When using tensorboard integration with SAC no data are written on the events file. The model training is done without problem and the metrics are correctly stored in self.logger.name_to_value dictionary of the model. However the events.out.tfevents file produced for tensorboard does not contain any of those data and I have checked that its size is very small, approximately 88 bytes. The same issue does not happen when using the PPO algorithm with the exact same code and configuration. It normally produces the events.out.tfevents file with 5649 bytes and the metrics are shown on tensorboard.

I have seen a similar issue here: https://github.com/DLR-RM/stable-baselines3/issues/1419 but it is using an older version. I have found a workaround for now using a callback and I log the train metrics manually using the self.logger.name_to_value dictionary. However this is a very strange issue. I am using a custom gymnasium environment and an alpha version of stable-baselines3 to be compatible with gymnasium. Thank you for maintaining this library!

Code example

import gymnasium as gym
from stable_baselines3 import SAC
from stable_baselines3.common.monitor import Monitor

def train_SAC_policy(env_name, timesteps=1_000_000):
    log_dir = f"models/SAC_{env_name}_t{timesteps}"
    env = gym.make(env_name, render_mode=None)
    env = Monitor(env, log_dir+"/train")

    model = SAC("MlpPolicy", env, verbose=1, tensorboard_log=log_dir)

    model.learn(total_timesteps=timesteps, tb_log_name="SAC", progress_bar=True)
    model.save(log_dir+"/model")

Relevant log output / Error message

No error occuring

Env Checker Result:
stable_baselines3/common/env_checker.py:422: UserWarning: We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) cf.

System Info

Python 3.8.16 Stable-Baselines3 2.0.0a13 installed via pip install sb3_contrib Gymnasium 0.28.1 PyTorch: 2.0.1+cu117 Tensorflow 2.12.0 Tensorboard 2.12.3 Numpy 1.24.3

Conda environment and lbraries installed with: conda 23.3.1 pip 23.1.2

Checklist

ronTohar1 commented 1 year ago

Same problem actually happend to me - just with the TD3/PPO model . My specs are smiliar : installed sb3 with pip install "stable_baselines3[extra]>=2.0.0a9" Custom environment that is very simple - multi agent maze only using numpy.

Logging to tensorboard with A2C for example worked fine.

My environment uses Box action space and Dictionary observation space.

qgallouedec commented 1 year ago

I can't reproduce what you describe. I'm using exactly the same code (I've just added train_SAC_policy("Pendulum-v1", 10_000)) and I get a correct tensorboard logging

Screenshot 2023-06-15 at 10 17 36

and

% ls -alt models/SAC_Pendulum-v1_t10000/SAC_1
total 16
drwxr-xr-x  5 quentingallouedec  staff   160 Jun 15 10:14 ..
-rw-r--r--  1 quentingallouedec  staff  5404 Jun 15 10:14 events.out.tfevents.1686816821.MacBook-Pro-de-Quentin.local.8647.0
drwxr-xr-x  3 quentingallouedec  staff    96 Jun 15 10:13 .

System info

stelladk commented 1 year ago

It has to do with the custom environment then? Is there a function we need to override for tensorboard integration with custom envs? I also notice we have a different Python version. @ronTohar1 what Python version are you using?

qgallouedec commented 1 year ago

I also notice we have a different Python version. @ronTohar1 what Python version are you using?

I've just tried with Python3.8, it works as well.

Is there a function we need to override for tensorboard integration with custom envs?

No. Have you check (with the env checker) your custom env? Does it work with Pendulum in your setting?

araffin commented 1 year ago

my guess is that it is related to you custom env and that you also don't have any info in the terminal. The reason is that SAC logs things every 4 episodes by default, whereas PPO/A2C logs every n steps. A solution is to force logging using a callback (see documentation).

EDIT: my guess is that you have very long episodes

ReHoss commented 1 year ago

@araffin , ok I see I guessed it was related to the datatype of the tensors, but In fact I bet it is related to the done value in step.

It works fine of my custom env as I defined a proper done.

@stelladk Could you check if done is well defined then ? Like set to true at some point?

Then if it is the reason, it would be nice to have the env checker to check done is well implemented. But this is not possible I guess?

Thanks,

ronTohar1 commented 1 year ago

I am using Python 3.11.4 btw. What do you mean by well defined done? Do you mean that eventually terminated or truncated becomes true? Also what bothers me is how can it be that A2C logs fine and PPO doesn't log at all? I can send a link to my repository with the code and the environment as it is very simple and will be easy to reproduce results (env is just a nxn board with agents and goals)

stelladk commented 1 year ago

Indeed the environment does not change the terminated or truncated flags. However, I also noticed that Pendulum-v1 on Gymnasium also does not change it (return self._get_obs(), -costs, False, False, {}) and it is also not working in my case. Environment MountainCarContinuous-v0 has an actual break condition and it logs normally on tensorboard.

ronTohar1 commented 1 year ago

Just so I make it clear, my termination and truncation values are well defined and return from step function as expected - true when finished or false otherwise

araffin commented 1 year ago

Indeed the environment does not change the terminated or truncated flags. However, I also noticed that Pendulum-v1 on Gymnasium also does not change it (return self._get_obs(), -costs, False, False, {}) and it is also not working in my case. Environment MountainCarContinuous-v0 has an actual break condition and it logs normally on tensorboard.

pendulum like most gym env have a time out (timelimit wrapper), it's defined when registering the env.

ronTohar1 commented 1 year ago

Well I found a what caused my problem - not seeing any logs on tensorboard nor on screen (stdout). calling PPO learn like this:

agent.learn(total_timesteps=num_steps, log_interval=100, tb_log_name=name)

This worked with A2C fine and I saw all the logs. PPO didn't log as I said until I changed the code to:

agent.learn(total_timesteps=num_steps, log_interval=1, tb_log_name=name)

or removing the log_interval completely.

I dont know why but I guess that PPO logs every 2048 steps?

So this helped me.

ReHoss commented 1 year ago

Well I found a what caused my problem - not seeing any logs on tensorboard nor on screen (stdout). calling PPO learn like this:

agent.learn(total_timesteps=num_steps, log_interval=100, tb_log_name=name)

This worked with A2C fine and I saw all the logs. PPO didn't log as I said until I changed the code to:

agent.learn(total_timesteps=num_steps, log_interval=1, tb_log_name=name)

or removing the log_interval completely.

I dont know why but I guess that PPO logs every 2048 steps?

So this helped me.

Removing log_interval, defaults to log_interval=1.

https://github.com/DLR-RM/stable-baselines3/blob/d68ff2e17f2f823e6f48d9eb9cee28ca563a2554/stable_baselines3/common/on_policy_algorithm.py#L258-L281

below the code of the concerned loop, why does 100 would fail in your case ? which value did you set for total_timesteps ? could you identify if you get through the if condition line 268 ?

ronTohar1 commented 1 year ago

Sorry for the late response. The value for total_timesteps was 100_000. What I think the problem is if I put log_interval=1, the first output I get is as follows: image Now I don't know why it says 2048 timesteps at the first logging, but I think it tries to log after 2048 * log_interval times, because similarly when log_interval=2 the same happens just with 4096 steps at the first output logged onto the screen.

As you asked I did check and I get through the if codition in line 268 as you would think. The only presence of the number 2048 is, as I could see in the debugger, is the n_rollout_steps variable. This is I guess why only after 2048 steps I can see the output? I am not sure why it is exactly what happens but this is my guess. I think that the self.collect_rollouts(...) just does 2048 steps because of the n_rollout_steps variable. That is why probably it takes 2048 steps for 1 iteration.