Add support for pretraining [feature request]

skervim commented 4 years ago

First: I'm very happy to see the new PyTorch SB3 version! Great job!

My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.

In my opinion it is a very valuable feature!

araffin commented 4 years ago

My question is whether pretraining-support is planned for SB3 (like for SB: https://stable-baselines.readthedocs.io/en/master/guide/pretrain.html). I couldn't find it being mentioned in the Roadmap.

As mentioned in the design choices (see https://github.com/hill-a/stable-baselines/issues/576), everything that is related to imitation learning (it includes GAIL and the pretraining using behavior cloning) will be done outside (certainly in this repo: https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.).

Otherwise, you can check that repo https://github.com/joonaspu/video-game-behavioural-cloning by @Miffyli et al. where pre-training is done using PyTorch.

We may add an example though (and maybe include it in the zoo), as it is simple to implement in some cases.

araffin commented 4 years ago

@skervim we would be happy if you could provide such example ;) (maybe as a colab notebook)

Miffyli commented 4 years ago

With SB3, I think this should be off-loaded to users indeed. The SB's pretrain function was promising but it was somewhat limiting. With SB3 we could provide interfaces to obtain a policy of right shape given an environment, then user can take this policy and do their own imitation learning (e.g. supervised learning on some dataset of demonstrations), and upload those parameters to policy.

araffin commented 4 years ago

With SB3 we could provide interfaces to obtain a policy of right shape given an environment,

This is already the case, no?

Miffyli commented 4 years ago

With SB3 we could provide interfaces to obtain a policy of right shape given an environment,

This is already the case, no?

Fair point, it is not hidden per-se, one just needs to know what to access to obtain this policy. An example code of this in the docs should do the trick :)

skervim commented 4 years ago

I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?

"""
Example code for behavioral cloning
"""
from stable_baselines3 import PPO
import gym

# Initialize environment and agent
env = gym.make("MountainCarContinuous-v0")
ppo = PPO("MlpPolicy", env)

# Extract initial policy
policy = ppo.policy

# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)

# Insert pretrained policy back into agent
ppo.policy = pretrained_policy

# Perform training
ppo.learn(total_timesteps=int(1e6))

araffin commented 4 years ago

I'm not completely sure if I am following. In case of behavioral cloning, you two suggest something like the following?

yes. In practice, because ppo.policy is an object, it could be modified by reference, so policy = ppo.policy and ppo.policy = pretrained_policy could be removed (even though it is cleaner written like you did).

skervim commented 4 years ago

FYI, my use case is that I have a custom environment and would like to pretrain an SB3 ppo agent with an expert dataset that I have created for that environment in a simple behavioral cloning fashion. Then I would like to continue training the pretrained agent.

I would gladly provide an example, as suggested by @araffin, but I'm not completely sure how it should look like.

Is @AdamGleave's https://github.com/HumanCompatibleAI/imitation going to support SB3 soon? In that case, should the part:

# Perform behavioral cloning with external code
pretrained_policy = external_supervised_learning(policy, external_dataset)

be implemented there and then an example should be created in the SB3 documentation?

Which parts are needed for such an implementation?

Code to create an expert data set by simulating an environment (with some agent/policy) and storing observations and actions (like: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/gail/dataset/record_expert.py)
Code to represent an expert data set, and to provide batches, shuffling etc. (like: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/gail/dataset/dataset.py). Should this be written from scratch, or reused? Or should this be used: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/dataset.py ?
PyTorch code to perform supervised learning.

Am I missing anything? I would like to contribute back to the repository and try to work on this task, however I think I would need some hint on how to start and could benefit from some guidance of those who have already worked on this problem.

araffin commented 4 years ago

be implemented there and then an example should be created in the SB3 documentation?

@AdamGleave is busy with NeurIPS deadline... so better to just create a stand-alone example as a colab notebook here (SB3 branch).

Code to create an expert data set by simulating an environment (with some agent/policy) and storing observations and actions

Usually people have their own format, but yes the dataset creation code from SB2 can be reused (it is not depending on TF at all).

Code to represent an expert data set, and to provide batches, shuffling etc.

Yes, but this will be contained in the training loop normally. (the SB2 code can be simplified as we don't support GAIL) I'm not sure we need a class for that in a stand-alone code.

PyTorch code to perform supervised learning.

your 2nd and 3rd point can be merged into one I think.

araffin commented 4 years ago

Last thing, it is not documented yet, but policies can be saved and loaded without a model now ;).

EDIT: model = PPO("MlpPolicy", "MountainCarContinuous-v0") works too

skervim commented 4 years ago

Alright, thanks for the clarifications. I will try to implement a simple standalone example, and PR it as a colab notebook to the SB3 branch when I have it working!

araffin commented 4 years ago

@skervim I updated the notebook and added support for discrete actions + SAC/TD3

You can try the notebook online here

We just need to update the documentation and we can close this issue.

skervim commented 4 years ago

@araffin: Glad that I could contribute, and happy to have learned something new from your improvements to the notebook :)

flint-xf-fan commented 4 years ago

I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?

flint-xf-fan commented 4 years ago

@skervim I updated the notebook and added support for discrete actions + SAC/TD3

You can try the notebook online here

We just need to update the documentation and we can close this issue.

I downloaded the notebook and run on RTX2070 GPU with CUDA10.1 on Ubuntu 18.04. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
 in 
----> 1 mean_reward, std_reward = evaluate_policy(a2c_student, env, n_eval_episodes=10)
      2 
      3 print(f"Mean reward = {mean_reward} +/- {std_reward}")

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/evaluation.py in evaluate_policy(model, env, n_eval_episodes, deterministic, render, callback, reward_threshold, return_episode_rewards)
     37         episode_length = 0
     38         while not done:
---> 39             action, state = model.predict(obs, state=state, deterministic=deterministic)
     40             obs, reward, done, _info = env.step(action)
     41             episode_reward += reward

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/base_class.py in predict(self, observation, state, mask, deterministic)
    287             (used in recurrent policies)
    288         """
--> 289         return self.policy.predict(observation, state, mask, deterministic)
    290 
    291     @classmethod

~/anaconda3/envs/sb3-torch1.6/lib/python3.6/site-packages/stable_baselines3/common/policies.py in predict(self, observation, state, mask, deterministic)
    155         observation = observation.reshape((-1,) + self.observation_space.shape)
    156 
--> 157         observation = th.as_tensor(observation).to(self.device)
    158         with th.no_grad():
    159             actions = self._predict(observation, deterministic=deterministic)

RuntimeError: CUDA error: an illegal memory access was encountered

Miffyli commented 4 years ago

I want to ask something related to this. Instead of generating "expert data" after the teacher has been trained, how do I directly save the trajectory of the teacher during training as the "expert data", and use that data to train my student?

Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.

I downloaded the notebook and run on RTX2070 GPU with CUDA10.2 on Ubuntu 10.2. The whole notebook is working fine except for the last cell of evaluating the policy giving the error the following error. Any hints?

I have no idea what could cause that, sorry :/

flint-xf-fan commented 4 years ago

Easiest way to do this would be to save states and actions in the environment, e.g. some kind of a wrapper that keeps track of states and actions and saves them into a file once done is encountered.

thanks

I have no idea what could cause that, sorry :/

Ah, np. It seems to be from pytorch's side.

DLR-RM / stable-baselines3

Add support for pretraining [feature request] #27