Closed llewynS closed 1 month ago
After a lot of digging around, I've noted that the policy is made on the cpu and then shifted to the device selected in the on policy algorithm class at this line
This works for action dists that aren't the use_sde ones but the use_sde one is a distribution. A work around to get it to work is to do this:
from stable_baselines3 import PPO
import gymnasium as gym
from stable_baselines3.common.vec_env import DummyVecEnv
from torch import tensor
# from rlos.rl.core.concreteclasses.wrappedlibs.SB3 import PPOSB3Agent
OPEN_DIVERSE_G = [[1, 1, 1, 1, 1, 1, 1],
[1, "c", "c", 0, "c", "c", 1],
[1, 1, 1, 1, 1, 1, 1]]
model_kwargs = {"verbose": 1,
"learning_rate": 3e-5,
"policy_kwargs": dict(net_arch=[256,256]),
"gamma": 0.95,
"device": "cuda",
"vf_coef": 0.5,
"ent_coef": 0.0,
"max_grad_norm": 0.5,
"normalize_advantage": True,
"n_steps": 512,
"n_epochs": 60,
"sde_sample_freq": 4,
"use_sde": True,
"gae_lambda": 0.9,
"clip_range": 0.4}
env = gym.make('PointMaze_UMaze-v3', maze_map = OPEN_DIVERSE_G)
env = DummyVecEnv([lambda: env])
model = PPO("MultiInputPolicy", env, **model_kwargs)
observation = env.reset()
temp = {
"observation": tensor(observation["observation"], device="cuda"),
"achieved_goal": tensor(observation["achieved_goal"], device="cuda"),
"desired_goal": tensor(observation["desired_goal"], device="cuda")
}
model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda")
model.policy._predict(temp)
hello,
why would try to access the private method model.policy._predict(temp)
?
Does it crashes if you call .learn()
before?
This looks similar to https://github.com/DLR-RM/stable-baselines3/issues/44 but should have been fixed in https://github.com/DLR-RM/stable-baselines3/pull/45
Maybe a reset_noise()
before predict should solves that.
Also, I would recommend using deterministic=True
at test time, gSDE is meant to improve the smoothness of the action noise during training.
Hi,
As you suggested to me in this feature request , it is so I can directly use tensors without having to detach/put them on the CPU.
Yes where I actually use it in my code I call model.learn(0)
first and still get the error.
model.policy.action_dist.exploration_mat = model.policy.action_dist.exploration_mat.to("cuda")
resolves the issue but it would be better to have the library not require one do this imo.
Hello,
as I wrote, it seems that calling model.policy.reset_noise()
before the first predict solves the issue (btw, the provided code was not working):
import gymnasium as gym
import numpy as np
import torch as th
from stable_baselines3 import PPO
env = gym.make("Pendulum-v1")
model = PPO("MlpPolicy", env, use_sde=True, seed=1, verbose=1)
obs, _ = env.reset()
device = model.device
# Single observation
tensor = th.as_tensor(obs[np.newaxis, ...]).to(device)
# Multiple observations
multi_obs = th.cat([tensor] * 5, dim=0).to(device)
# Sample noise for gSDE on the correct device
model.policy.reset_noise()
with th.no_grad():
model.policy._predict(tensor)
model.policy._predict(multi_obs)
🐛 Bug
Get a device mismatch when attempting to use PPO with multiinput dict.
This was when calling:
Looking at stable baselines the issue comes from this code:
Doing some more digging it's actually this class, this class doesn't define the device, when the network is created standard python types are input so it just gets put on the CPU. This class needs to be modified to make it take device and use the device of the model it is being used with.
To Reproduce
If you change the model_kwargs so that use_sde is False then it works as expected.
Relevant log output / Error message
No response
System Info
No response
Checklist