Stable-Baselines-Team / stable-baselines3-contrib

Contrib package for Stable-Baselines3 - Experimental reinforcement learning (RL) code
https://sb3-contrib.readthedocs.io
MIT License
504 stars 175 forks source link

[Question] what would I got if I manage the train like this in SubprocVecEnv? #187

Closed Pborz closed 1 year ago

Pborz commented 1 year ago

❓ Question

there are two blocks of testing structure below, i don't know which way should be the very right way of LSTM testing, because LSTM network would not present the fatal error while the code running right in surface but wrong in deep, such as testing a recurrent policy with a different number of environments than during training, the internal states might not match up correctly, resulting in unpredictable behavior and potentially suboptimal performance.

1、n_envs supposed to be more than 1 train_env = SubprocVecEnv([make_env(train_provider, i) for i in range(self.n_envs)]) 2、model = RecurrentPPO("MlpLstmPolicy", train_env , verbose=1) model.learn(5000) 3、`

   test_env =make_env(train_provider) 

   zero_completed_obs  = test_env.reset()

   action, state = model.predict(zero_completed_obs, state=state, deterministic=True)

   obs, reward, done, info = test_env.step(action)`

--does the code structure not correct because of the numbers issue of envs between train and test? besides, if I make in this way below, does it test in right way?--

1、saved the SubprocVecEnv multi envs trained model into model_path 2、load it with init_envs that contains same n_envs `

    init_envs = DummyVecEnv([make_env(test_provider) for _ in range(self.n_envs)])

    model_path = path.join('data', 'agents', f'{self.study_name}__{model_epoch}.pkl')
    model = self.Model.load(model_path, env=init_envs)

    test_env =  make_env(test_provider) 

    zero_completed_obs = np.zeros((self.n_envs,) + init_envs.observation_space.shape)
    zero_completed_obs[0, :] = test_env.reset()

    state = None
    rewards = []

    action, state = model.predict(zero_completed_obs, state=state)
    obs, reward, done, info = test_env.step([action[0]])

    zero_completed_obs[0, :] = obs`

3、but when step testing, step into "test_env" which is just single env

Checklist

araffin commented 1 year ago

Hello, I'm not sure to understand your question. You want to evaluate on a different number of environments? That's fine (you can check the documentation for an example).

Pborz commented 1 year ago

Hello, I'm not sure to understand your question. You want to evaluate on a different number of environments? That's fine (you can check the documentation for an example).

If the number of env for lstm is really probably nothing, then I really suggest to remove this one misleading sentence below https://stable-baselines.readthedocs.io/en/master/guide/examples.html#recurrent-policies

One current limitation of recurrent policies is that you must test them with the same number of environments they have been trained on.

besides, i wonder that, training 2 env with SubprocVecEnv, and then using sigle env zero_completed_obs to test, has any different with using two env obs but one of them setting zeros?

is that clear? i mean a) not using any vec env, just one single env obs to input testing like [2,3,4,5,6] b)using vec env, make up two, but just passing one, while another is zero like [ [2,3,4,5,6] , [0,0,0,0,0] ]

araffin commented 1 year ago

then I really suggest to remove this one misleading sentence below

This is SB2 documentation (using tensorflow 1), and not SB3 or SB3 contrib documentation, which is there: https://sb3-contrib.readthedocs.io/en/master/modules/ppo_recurrent.html

and then using sigle env zero_completed_obs to test, has any different with using two env obs but one of them setting zeros?

Why would you test on two envs if you zero out the observation for one?

Pborz commented 1 year ago

https://github.com/hill-a/stable-baselines/issues/166#issuecomment-502350843 ok then, but, let me change to exactly express: i saw issue above says:

166

araffin commented on Jun 15, 2019 • Hello,

You can find below a working example: ......... // Train with 2 envs n_training_envs = 2 envs = DummyVecEnv([makeenv() for in range(n_training_envs)]) model = PPO2("MlpLstmPolicy", envs, nminibatches=2)

// Create one env for testing test_env = DummyVecEnv([makeenv() for in range(1)]) test_obs = test_env.reset()

// model.predict(test_obs) would through an error //because the number of test env is different from the number of training env // so we need to complete the observation with zeroes zero_completed_obs = np.zeros((n_training_envs,) + envs.observation_space.shape) zero_completed_obs[0, :] = test_obs

actually, I once tried model.predict(test_obs) to a mulit-env trained model and nothing wrong. but zero_completed_obs = np.zeros((n_training_envs,) + envs.observation_space.shape) zero_completed_obs[0, :] = test_obs this way nothing wrong either.

besides,I don't think SB2 sentence would rather really has totally nothing to do with SB3 or SB3contrib, thus, concern would be careful enough to taken if some expression like that take place lmao :->

OK, seriously: when we train model in two or more env with recurrent ppo, compared to train single, does the lstm network really clear when passed the [0: ] like shape into it? I mean, the network would mass out answer when passed [0: ] but not showing error in surface.

Pborz commented 1 year ago

@araffin Helloooooooooooooooo