Got an unexpected keyword argument 'use_sde' when passing behavioural cloning policy to PPO from SB3

JkAcktuator commented 11 months ago

Bug description

Hello,

I want to pass the policy learned from behavioural cloning in imitation library to PPO, I thought it would be successful since they are both from ActorCriticPolicy class, however it doesn't work as I expected.

Steps to reproduce

from stable_baselines3 import PPO
from imitation.algorithms import bc

bc_trainer = bc.BC(
observation_space=env.observation_space,
action_space=env.action_space,
device='cuda',
policy=bc.reconstruct_policy(policy_path, device='cuda'),
rng=rng,
)
model = PPO(policy=bc_trainer.policy, env=env, verbose=1, device = 'cuda')

The error is:

Traceback (most recent call last): File "agent/main.py", line 142, in model = PPO(policy=bc_trainer.policy, env=env, verbose=1, device = 'cuda') File "/home/repos/stable-baselines3/stable_baselines3/ppo/ppo.py", line 164, in init self._setup_model() File "/home/repos/stable-baselines3/stable_baselines3/ppo/ppo.py", line 167, in _setup_model super()._setup_model() File "/home/repos/stable-baselines3/stable_baselines3/common/on_policy_algorithm.py", line 120, in _setup_model self.policy = self.policy_class( # pytype:disable=not-instantiable File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) TypeError: forward() got an unexpected keyword argument 'use_sde'

Environment

Operating system and version: Ubuntu 20.04.6 LTS
Python version: 3.8.0
Pytorch 1.13.0
Imitation 0.4.0
Stable base line 1.8.0
Gym 0.21.0

yojul commented 11 months ago

Hi, I had the same error trying to retrain a policy with PPO after Behaviour Cloning. Actually, the problem is here : model = PPO(policy=bc_trainer.policy, env=env, verbose=1, device = 'cuda') because the policy argument expects a class (or alternatively a string). The error message is unclear since it is raised by PyTorch.

So when instantiating PPO with sb3, you should pass the policy class you want to use (which should inherit from ActorCriticPolicy). For example : model = PPO(policy=ActorCriticPolicy, env=env,verbose=1, device = 'cuda')

This should work for instantiating the PPO. However, I am not sure how you should load the pre-trained policy, I could not find the right way to do it in stable-baselines3 (I tried model.policy = bc_trainer.policy but not sure if it works properly).

Hope it helps somehow. Let me know if you find the right way to load a pre-trained policy with PPO algorithm 👍

AlexGisi commented 7 months ago

For those who stumble across this issue, the load_from_vector method seems to work:

pretrained_policy = ActorCriticPolicy.load("/path/")
model = PPO(ActorCriticPolicy, env)
model.policy.load_from_vector(pretrained_policy.parameters_to_vector())
model.learn(total_timesteps=100_000, reset_num_timesteps=False)

saeed349 commented 5 months ago

I had issue with saving and loading BC models and the below worked for me

from stable_baselines3.common import policies
# Saving
bc_model = bc.BC(
    observation_space=venv.observation_space,
    action_space=venv.action_space,
    demonstrations=transitions_custom,
    rng=rng,
)
bc_model.policy.save('models/test/model.zip')

# Loading
pretrained_policy = policies.ActorCriticPolicy.load('models/test/model.zip')
bc_model_reloaded = bc.BC(
    observation_space=venv.observation_space,
    action_space=venv.action_space,
    demonstrations=transitions_custom,
    rng=rng,
    policy = pretrained_policy
)

saeed349 commented 4 months ago

I followed the above method mentioned by @yojul to load a BC model in SB3. But when I retrain the model in SB3 and then save it and try to reload it using PPO.load like below. I am getting an error of shape mismatch when copying weights. I am guessing this is due to the difference between imitation.policies.base.FeedForward32Policy and stable_baselines3 ActorCriticPolicy.

Can @AlexGisi, @yojul or @JkAcktuator share how you overcome this issue ?

pretrained_policy= policies.ActorCriticPolicy.load("imitation_bc_model.zip")
model = PPO(policy=policies.ActorCriticPolicy,env=env)
model.policy = pretrained_policy

model.learn(total_timesteps=100_000)
model.save("sb3_model.zip")

del model

model = PPO.load(("sb3_model.zip") # this throws an error

CAI23sbP commented 4 months ago

Hi @saeed349!. Here is my code.


dense_rollouts = rollout.rollout(
    dense_expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(dense_env)]),
    rollout.make_sample_until(min_timesteps=None, min_episodes=250),
    rng=dense_rng,
)
dense_transitions = rollout.flatten_trajectories(dense_rollouts)

dense_bc = CustomBC(

                    observation_space=dense_env.observation_space,
                    action_space=dense_env.action_space,
                    policy = dense_expert.policy,
                    demonstrations=dense_transitions,
                    rng=dense_rng,
                    device = 'cuda',
                    tensorboard_log = f'/home/cai/Desktop/PILRnav/runs/dense_bc'
                    )
dense_bc.train(n_epochs=10)
dense_bc.policy.save("/home/cai/Desktop/PILRnav/weight/dense_bc")

dense_ppo = PPO(policy='MlpPolicy', 
             env=dense_env, 
             policy_kwargs = policy_kwargs,
             use_sde = False,
            batch_size = 64,
            n_epochs = 7,
            learning_rate= 0.0004,
             tensorboard_log= f'/home/cai/Desktop/PILRnav/runs/dense_ppo' , 
             verbose= 1 )

dense_ppo.policy = dense_ppo.policy.load("/home/cai/Desktop/PILRnav/weight/dense_bc")
dense_ppo.learn(MAX_ITER)
dense_ppo.policy.save("/home/cai/Desktop/PILRnav/weight/dense_ppo")

I hope that it will helpful.

HumanCompatibleAI / imitation