Closed skervim closed 4 years ago
My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.
As mentioned in the design choices (see https://github.com/hill-a/stable-baselines/issues/576), everything that is related to imitation learning (it includes GAIL and the pretraining using behavior cloning) will be done outside (certainly in this repo: https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.).
Otherwise, you can check that repo https://github.com/joonaspu/video-game-behavioural-cloning by @Miffyli et al. where pre-training is done using PyTorch.
We may add an example though (and maybe include it in the zoo), as it is simple to implement in some cases.
@skervim we would be happy if you could provide such example ;) (maybe as a colab notebook)
With SB3, I think this should be off-loaded to users indeed. The SB's pretrain function was promising but it was somewhat limiting. With SB3 we could provide interfaces to obtain a policy of right shape given an environment, then user can take this policy and do their own imitation learning (e.g. supervised learning on some dataset of demonstrations), and upload those parameters to policy.
With SB3 we could provide interfaces to obtain a policy of right shape given an environment,
This is already the case, no?
With SB3 we could provide interfaces to obtain a policy of right shape given an environment,
This is already the case, no?
Fair point, it is not hidden per-se, one just needs to know what to access to obtain this policy. An example code of this in the docs should do the trick :)
I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?
"""
Example code for behavioral cloning
"""
from stable_baselines3 import PPO
import gym
# Initialize environment and agent
env = gym.make("MountainCarContinuous-v0")
ppo = PPO("MlpPolicy", env)
# Extract initial policy
policy = ppo.policy
# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)
# Insert pretrained policy back into agent
ppo.policy = pretrained_policy
# Perform training
ppo.learn(total_timesteps=int(1e6))
I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?
yes. In practice, because ppo.policy
is an object, it could be modified by reference, so policy = ppo.policy
and ppo.policy = pretrained_policy
could be removed (even though it is cleaner written like you did).
FYI, my use case is that I have a custom environment and would like to pretrain an SB3 ppo agent with an expert dataset that I have created for that environment in a simple behavioral cloning fashion. Then I would like to continue training the pretrained agent.
I would gladly provide an example, as suggested by @araffin, but I'm not completely sure how it should look like.
Is @AdamGleave's https://github.com/HumanCompatibleAI/imitation going to support SB3 soon? In that case, should the part:
# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)
be implemented there and then an example should be created in the SB3 documentation?
Which parts are needed for such an implementation?
Am I missing anything? I would like to contribute back to the repository and try to work on this task, however I think I would need some hint on how to start and could benefit from some guidance of those who have already worked on this problem.
be implemented there and then an example should be created in the SB3 documentation?
@AdamGleave is busy with NeurIPS deadline... so better to just create a stand-alone example as a colab notebook here (SB3 branch).
Code to create an expert data set by simulating an environment (with some agent/policy) and storing observations and actions
Usually people have their own format, but yes the dataset creation code from SB2 can be reused (it is not depending on TF at all).
Code to represent an expert data set, and to provide batches, shuffling etc.
Yes, but this will be contained in the training loop normally. (the SB2 code can be simplified as we don't support GAIL) I'm not sure we need a class for that in a stand-alone code.
PyTorch code to perform supervised learning.
your 2nd and 3rd point can be merged into one I think.
Last thing, it is not documented yet, but policies can be saved and loaded without a model now ;).
EDIT: model = PPO("MlpPolicy", "MountainCarContinuous-v0")
works too
Alright, thanks for the clarifications. I will try to implement a simple standalone example, and PR it as a colab notebook to the SB3 branch when I have it working!
@skervim I updated the notebook and added support for discrete actions + SAC/TD3
You can try the notebook online here
We just need to update the documentation and we can close this issue.
@araffin: Glad that I could contribute, and happy to have learned something new from your improvements to the notebook :)
I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?
@skervim I updated the notebook and added support for discrete actions + SAC/TD3
You can try the notebook online here
We just need to update the documentation and we can close this issue.
I downloaded the notebook and run on RTX2070 GPU with CUDA10.1 on Ubuntu 18.04. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
in
----> 1 mean_reward, std_reward = evaluate_policy(a2c_student, env, n_eval_episodes=10)
2
3 print(f"Mean reward = {mean_reward} +/- {std_reward}")
~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/evaluation.py in evaluate_policy(model, env, n_eval_episodes, deterministic, render, callback, reward_threshold, return_episode_rewards)
37 episode_length = 0
38 while not done:
---> 39 action, state = model.predict(obs, state=state, deterministic=deterministic)
40 obs, reward, done, _info = env.step(action)
41 episode_reward += reward
~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/base_class.py in predict(self, observation, state, mask, deterministic)
287 (used in recurrent policies)
288 """
--> 289 return self.policy.predict(observation, state, mask, deterministic)
290
291 @classmethod
~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/policies.py in predict(self, observation, state, mask, deterministic)
155 observation = observation.reshape((-1,) + self.observation_space.shape)
156
--> 157 observation = th.as_tensor(observation).to(self.device)
158 with th.no_grad():
159 actions = self._predict(observation, deterministic=deterministic)
RuntimeError: CUDA error: an illegal memory access was encountered
I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?
Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.
I downloaded the notebook and run on RTX2070 GPU with CUDA10.2 on Ubuntu 10.2. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?
I have no idea what could cause that, sorry :/
Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.
thanks
I have no idea what could cause that, sorry :/
Ah, np. It seems to be from pytorch's side.
First: I'm very happy to see the new PyTorch SB3 version! Great job!
My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.
In my opinion it is a very valuable feature!