[Feature Request] Add capability to log on a step-based interval in OffPolicyAlgorithm

tobiabir commented 1 year ago

🚀 Feature

At the time of writing the logging interval is controlled by the log_interval argument of the learn method. Permitted are integers. In OnPolicyAlgorithm this is the number of rounds (environment interaction + training steps) and in OffPolicyAlgorithm the number of episodes between logging.

Since episodes in general do not have fixed length or an end at all, logging on an episode-basis is not always practical (see Motivation). Could we add the capability to log on a step-based interval?

Motivation

The main motivation is experiment tracking. It is good practice to run experiments multiple times with different random seeds and display training plots with confidence intervals or min and max. If you want to plot against environment steps (e.g. when you are interested sample complexity) you can't really do that properly if the logs are not done on a step-basis. This is because with variable episode length the logs will not be aligned and it is difficult to compute the confidence intervals (e.g. what should the interval at step x be if you have a log at step x - 3 and x + 5). So it would be great if this could be added.

Note also that the documentation is currently stating for all algorithms that log_interval is "The number of episodes before logging.", which is not true for on-policy algorithms. Issue #725 is related to that.

Pitch

I propose we change the OffPolicyAlgorithm case to be the same as the OnPolicyAlgorithm case. Then logging on a step-basis can be done using a train_freq on a step-basis. This has the additional benefit of more consistency between on-policy and off-policy algorithms.

While we are at it I also propose to move logging to after training in both OnPolicyAlgorithm and OffPolicyAlgorithm. That way we can get all the information of one round in the same log. As it is the information of the last training steps will not be logged.

Alternatives

The least invasive (to sb3) alternative is to modify the environments of interest to have fixed episode length. However, this might not be practical for all environments and seems ugly.

Another alternative would be to allow log_interval to be a tuple (e.g. (4, "episodes")) like train_freq. This also seems ugly.

Update: Custom callback. See discussion below.

Additional context

No response

Checklist

[X] I have checked that there is no similar issue in the repo
[X] If I'm requesting a new feature, I have proposed alternatives

araffin commented 1 year ago

Hello, have you considered callbacks as an alternative? (see doc and section on tensorboard) they should allow you to log every n steps or every k iterations.

tobiabir commented 1 year ago

Hi.

Thank you for the quick answer and for that pointer!

Indeed, with a custom callback, logging on step-basis can be done. I quickly tested this on SAC with the following callback.

class TensorboardCallback(sb3.common.callbacks.BaseCallback):
    def __init__(self, log_interval):
        super().__init__()
        self.log_interval = log_interval

    def _on_step(self):
        if self.model.num_timesteps % self.log_interval == 0:
            self.model._dump_logs()

Some minor draw backs remain with this approach:

One has to avoid interference from the other logs. This can be done by setting log_interval in learn to np.inf.
Everyone who needs this has to reimplement a custom callback even though logging is otherwise nicely integrated into the code base.
The logging is still done before the training. So to get the information of the last training steps one has to also implement callback _on_training_end.

Would you still consider some of the proposed changes? I think it could clean things up and improve consistency. At least I think the documentation of log_interval should be fixed as for on-policy algorithms it is currently incorrect.

araffin commented 1 year ago

Adding a callback to do step based logging to the collection provided by sb3 would be a good addition i think. Logging more often shouldn't be a problem (or solvable as you describe).

tobiabir commented 1 year ago

Could you elaborate on your vision for this? Like what set of features should this callback support? Should it be a minimal callback for only step based logging or should it be more general logging callback supporting all sorts of logging applications?

araffin commented 1 year ago

Could you elaborate on your vision for this?

Have a simple LogEveryNSteps callback that calls self.logger.dump() every n steps (n calls to env.step(), it might correspond to more than one steps when using multiple envs, as shown in the doc).

general logging callback supporting all sorts of logging applications?

We leave general purpose/custom callbacks to the user.

DLR-RM / stable-baselines3