DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.99k stars 1.68k forks source link

[Question] Custom action space with PPO #1046

Closed tfederico closed 2 years ago

tfederico commented 2 years ago

Question

Hello,

is it possible to create a custom action space to use with PPO? From what I read in the documentation, there are limited Space instances allowed. But that means that I have to adhere to sampling actions with a normal distribution (in the case of Box). I would like to test a different distribution for my custom env.

Additional context

In fact, I tried to create my own by overriding gym.spaces.Space but I got an error because it is not allowed.

I guess a workaround could be deriving from gym.spaces.Box and overriding its methods (i.e., sample()). However, when I tried it seems that it never calls the overridden method, but rather the original one. Am I missing something here?

Many thanks in advance for your help!

Checklist

araffin commented 2 years ago

Hello,

But that means that I have to adhere to sampling actions with a normal distribution (in the case of Box). I would like to test a different distribution for my custom env.

You mean using a beta distribution for the agent for instance? See https://github.com/DLR-RM/stable-baselines3/issues/955

The action space will still be the same (continuous action space, aka Box) but the distribution will be different. For that, you need to fork of SB3 and update distributions.py.

Or are you talking about randomly sampling from the environment? (which is different from the probability distribution used by the agent).

tfederico commented 2 years ago

So from what I gathered (please correct me if I am mistaken, I probably am :) ) Box defines a continuous action space, but then when you call sample() the sampled action follows a normal distribution, based on the code .

If I edit the sample() method, I should be able to change the distribution as well, right? I guess it's a cheeky way of changing the distribution without changing the distribution

In the end, I would like to sample my actions from a truncated gaussian rather than a full gaussian

tfederico commented 2 years ago

Something like

from scipy.stats import truncnorm

class CheatingBox(gym.spaces.Box):
    def __init__(self, low, high, shape=None, dtype=np.float32):
        super().__init__(low=low, high=high, shape=shape, dtype=dtype)

    def sample(self):
        return truncnorm.rvs(self.low, self.high, size=18)
tfederico commented 2 years ago

Would this be a correct implementation of an half gaussian distribution? Honestly I just reverse-engineered the DiagonalGaussianDistribution class and changed proba_distribution and mode, so I am not sure whether it is conceptually correct :)

class HalfGaussianDistribution(Distribution):
    """
    Half-Gaussian distribution, for continuous actions.

    :param action_dim:  Dimension of the action space.
    """

    def __init__(self, action_dim: int):
        super().__init__()
        self.action_dim = action_dim
        self.mean_actions = None
        self.log_std = None

    def proba_distribution_net(self, latent_dim: int, log_std_init: float = 0.0) -> Tuple[nn.Module, nn.Parameter]:
        """
        Create the layers and parameter that represent the distribution:
        one output will be the mean of the Gaussian, the other parameter will be the
        standard deviation (log std in fact to allow negative values)

        :param latent_dim: Dimension of the last layer of the policy (before the action layer)
        :param log_std_init: Initial value for the log standard deviation
        :return:
        """
        mean_actions = nn.Linear(latent_dim, self.action_dim)
        # TODO: allow action dependent std
        log_std = nn.Parameter(th.ones(self.action_dim) * log_std_init, requires_grad=True)
        return mean_actions, log_std

    def proba_distribution(self, mean_actions: th.Tensor, log_std: th.Tensor) -> "DiagGaussianDistribution":
        """
        Create the distribution given its parameters (mean, std)

        :param mean_actions:
        :param log_std:
        :return:
        """
        action_std = th.ones_like(mean_actions) * log_std.exp()
        self.distribution = HalfNormal(action_std)
        return self

    def log_prob(self, actions: th.Tensor) -> th.Tensor:
        """
        Get the log probabilities of actions according to the distribution.
        Note that you must first call the ``proba_distribution()`` method.

        :param actions:
        :return:
        """
        log_prob = self.distribution.log_prob(actions)
        return sum_independent_dims(log_prob)

    def entropy(self) -> th.Tensor:
        return sum_independent_dims(self.distribution.entropy())

    def sample(self) -> th.Tensor:
        # Reparametrization trick to pass gradients
        return self.distribution.rsample()

    def mode(self) -> th.Tensor:
        return self.distribution.mode

    def actions_from_params(self, mean_actions: th.Tensor, log_std: th.Tensor, deterministic: bool = False) -> th.Tensor:
        # Update the proba distribution
        self.proba_distribution(mean_actions, log_std)
        return self.get_actions(deterministic=deterministic)

    def log_prob_from_params(self, mean_actions: th.Tensor, log_std: th.Tensor) -> Tuple[th.Tensor, th.Tensor]:
        """
        Compute the log probability of taking an action
        given the distribution parameters.

        :param mean_actions:
        :param log_std:
        :return:
        """
        actions = self.actions_from_params(mean_actions, log_std)
        log_prob = self.log_prob(actions)
        return actions, log_prob
araffin commented 2 years ago

an half gaussian distribution?

looks ok, I'm just wondering in which context you would need a half normal distribution?