DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.84k stars 1.68k forks source link

[Bug]: PPO using SDE device issue. #1957

Closed llewynS closed 1 month ago

llewynS commented 3 months ago

🐛 Bug

Get a device mismatch when attempting to use PPO with multiinput dict.

This was when calling:


with torch_no_grad():
                actions = myppo.policy._predict(inp_dict, deterministic = isTraining)
File "C:\Users\User\AppData\Roaming\Python\Python311\site-packages\stable_baselines3\common\distributions.py", line 597, in get_noise
    return th.mm(latent_sde, self.exploration_mat)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat2 in method wrapper_CUDA_mm)

Looking at stable baselines the issue comes from this code:

 def sample_weights(self, log_std: th.Tensor, batch_size: int = 1) -> None:
        """
        Sample weights for the noise exploration matrix,
        using a centered Gaussian distribution.

        :param log_std:
        :param batch_size:
        """
        std = self.get_std(log_std)
        self.weights_dist = Normal(th.zeros_like(std), std)
        # Reparametrization trick to pass gradients
        self.exploration_mat = self.weights_dist.rsample()
        # Pre-compute matrices in case of parallel exploration
        self.exploration_matrices = self.weights_dist.rsample((batch_size,))

Doing some more digging it's actually this class, this class doesn't define the device, when the network is created standard python types are input so it just gets put on the CPU. This class needs to be modified to make it take device and use the device of the model it is being used with.

To Reproduce

from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
                [1, "c", "c", 0, "c", "c", 1],
                [1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1, 
                "learning_rate": 3e-5,
                "policy_kwargs": dict(net_arch=[256,256]),
                "gamma": 0.95,
                "device": "cuda",
                "vf_coef": 0.5,
                "ent_coef": 0.0,
                "max_grad_norm": 0.5,
                "normalize_advantage": True,
                "n_steps": 512,
                "n_epochs": 60,
                "sde_sample_freq": 4,
                "use_sde": True,
                "gae_lambda": 0.9,
                "clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
    "observation": tensor(observation["observation"], device="cuda"),
    "achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
    "desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy._predict(temp)

If you change the model_kwargs so that use_sde is False then it works as expected.

Relevant log output / Error message

No response

System Info

No response

Checklist

llewynS commented 3 months ago

After a lot of digging around, I've noted that the policy is made on the cpu and then shifted to the device selected in the on policy algorithm class at this line

This works for action dists that aren't the use_sde ones but the use_sde one is a distribution. A work around to get it to work is to do this:

from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
                [1, "c", "c", 0, "c", "c", 1],
                [1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1, 
                "learning_rate": 3e-5,
                "policy_kwargs": dict(net_arch=[256,256]),
                "gamma": 0.95,
                "device": "cuda",
                "vf_coef": 0.5,
                "ent_coef": 0.0,
                "max_grad_norm": 0.5,
                "normalize_advantage": True,
                "n_steps": 512,
                "n_epochs": 60,
                "sde_sample_freq": 4,
                "use_sde": True,
                "gae_lambda": 0.9,
                "clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
    "observation": tensor(observation["observation"], device="cuda"),
    "achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
    "desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda")
model.policy._predict(temp)
araffin commented 3 months ago

hello, why would try to access the private method model.policy._predict(temp)? Does it crashes if you call .learn() before?

This looks similar to https://github.com/DLR-RM/stable-baselines3/issues/44 but should have been fixed in https://github.com/DLR-RM/stable-baselines3/pull/45

Maybe a reset_noise() before predict should solves that. Also, I would recommend using deterministic=True at test time, gSDE is meant to improve the smoothness of the action noise during training.

llewynS commented 2 months ago

Hi,

As you suggested to me in this feature request , it is so I can directly use tensors without having to detach/put them on the CPU.

Yes where I actually use it in my code I call model.learn(0) first and still get the error.

model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda") resolves the issue but it would be better to have the library not require one do this imo.

araffin commented 2 months ago

Hello, as I wrote, it seems that calling model.policy.reset_noise() before the first predict solves the issue (btw, the provided code was not working):


import gymnasium as gym
import numpy as np
import torch as th
from stable_baselines3 import PPO

env = gym.make("Pendulum-v1")
model = PPO("MlpPolicy", env, use_sde=True, seed=1, verbose=1)
obs, _ = env.reset()

device = model.device

# Single observation
tensor = th.as_tensor(obs[np.newaxis, ...]).to(device)
# Multiple observations
multi_obs = th.cat([tensor] * 5, dim=0).to(device)

# Sample noise for gSDE on the correct device
model.policy.reset_noise()
with th.no_grad():
    model.policy._predict(tensor)
    model.policy._predict(multi_obs)