HumanCompatibleAI / imitation

Clean PyTorch implementations of imitation and reward learning algorithms
https://imitation.readthedocs.io/
MIT License
1.21k stars 232 forks source link

Problems using the custom environment that follows OpenAI interface #365

Closed Abermal closed 2 years ago

Abermal commented 2 years ago

Hello, first of all, thank you for the repo. I'm using the following environment:

I have created a simple environment for a 2d game I wrote and I would like to apply AIRL to obtain the reward function from my demonstrations.

class MyEnv:
   def __init__():
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(self._sett.n_sensors * 3,))
        self.action_space = spaces.multi_discrete.MultiDiscrete([3, 3])

    def reset(self):

    def step(self, action):

    def update(self, action):

I fixed several things in my env and the way I store the demonstrations so that it matches your repo, but I got stuck at the following error.

Traceback (most recent call last):
  File "D:/work/imitation/examples/quickstart.py", line 85, in <module>
    airl_trainer.train(total_timesteps=20480)
  File "D:\work\imitation\src\imitation\algorithms\adversarial\common.py", line 431, in train
    self.train_gen(self.gen_train_timesteps)
  File "D:\work\imitation\src\imitation\algorithms\adversarial\common.py", line 391, in train_gen
    self.gen_algo.learn(
  File "C:\Users\...\lib\site-packages\stable_baselines3\ppo\ppo.py", line 301, in learn
    return super(PPO, self).learn(
  File "C:\Users\...\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 237, in learn
    continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
  File "C:\Users\...\lib\site-packages\stable_baselines3\common\on_policy_algorithm.py", line 178, in collect_rollouts
    new_obs, rewards, dones, infos = env.step(clipped_actions)
  File "C:\Users\...\lib\site-packages\stable_baselines3\common\vec_env\base_vec_env.py", line 161, in step
    self.step_async(actions)
  File "C:\Users\I008658\Anaconda3\envs\iav\lib\site-packages\stable_baselines3\common\vec_env\base_vec_env.py", line 264, in step_async
    self.venv.step_async(actions)
  File "D:\work\imitation\src\imitation\rewards\reward_wrapper.py", line 84, in step_async
    return self.venv.step_async(actions)
  File "C:\Users\I008658\Anaconda3\envs\iav\lib\site-packages\stable_baselines3\common\vec_env\base_vec_env.py", line 264, in step_async
    self.venv.step_async(actions)
  File "D:\work\imitation\src\imitation\data\wrappers.py", line 52, in step_async
    self.venv.step_async(actions)
AttributeError: 'Toy_Environment' object has no attribute 'step_async'

I managed to debug it to the following place in a BaseAlgorithm class:

@staticmethod
    def _wrap_env(env: GymEnv, verbose: int = 0, monitor_wrapper: bool = True) -> VecEnv:
        # ...
        if not isinstance(env, VecEnv):
            if not is_wrapped(env, Monitor) and monitor_wrapper:
                if verbose >= 1:
                    print("Wrapping the env with a `Monitor` wrapper")
                env = Monitor(env)
            # TODO this is where it breaks
            if verbose >= 1:
                print("Wrapping the env in a DummyVecEnv.")
            env = DummyVecEnv([lambda: env])

Correct me if I'm wrong, but the DummyVecEnv wrapper class is supposed to give my env asynchronous functionality, right?\ I don't have much experience in async programming. Is there any way for me to use your repository without writing the step_async method?

Thank you for your anwer!

AdamGleave commented 2 years ago

All of our algorithms expect a VecEnv (see SB3 docs). You don't need to write any more code for your environment: just write something like venv = DummyVecEnv([lambda: MyEnv()]) and pass in venv instead of MyEnv().

Stable Baselines does do some magic Env to VecEnv wrapping (such as the extract you paste), but imitation does not do this.

Hope this helps.

Abermal commented 2 years ago

It helped, but I thought this wrapping is supposed to be performed automatically by stable-baselines. There is another error though... \ in finish_trajectory method my last observation has its batch dimension preserved for some reason, which leads to an error in np.stack(arr_list, axis=0).

My reset and step methods return the observation in the same format [1, obs_dim].\ Expert demonstrations follow this convention too.\ This error doesn't happen in a CartPole-v0 example.

image

Any ideas on how to fix this? Thanks once again for such a quick answer!

AdamGleave commented 2 years ago

imitation is not stable baselines. Just because they wrap the environment doesn't mean an unwrapped environment will work with our code.

I'd suspect an issue with what reset() is returning, are you sure it's not including an extra dimension? Otherwise I don't see why this environment would not work while others would. If you proivde a minimal example to reproduce the error I'm happy to try to debug.

ejnnr commented 2 years ago

My reset and step methods return the observation in the same format [1, obs_dim].

I think that might be the issue? step() and reset() observations should have the shape declared in the environment's observation_space, so just (self._sett.n_sensors * 3,) without the leading singleton dimension in your case. I haven't thought about if/why this would lead to the error you're seeing but you should remove the leading 1 in any case.

Abermal commented 2 years ago

Thanks @ejnnr, it worked!