araffin / rl-baselines-zoo

A collection of 100+ pre-trained RL agents using Stable Baselines, training and hyperparameter optimization included.
https://stable-baselines.readthedocs.io/
MIT License
1.12k stars 206 forks source link

Irreproducible zoo trials #108

Closed blurLake closed 3 years ago

blurLake commented 3 years ago

Hi, I am using zoo to optimise the parameters for SAC with a customised env. The code I used was

python3 train.py --algo sac --env FullFilterEnv-v0 --gym-packages gym_environment -n 50000 -optimize --eval-episodes 40 --n-trials 1000 --n-jobs 2 --sampler random --pruner median

I use --eval-episodes = 40 to have agents with more stable performance.

Something about the env. Each episode is at most 5 steps long. The rewards for usual steps are negative value of some Euclidean norm, say -||x-x_target||, and the successful step will get reward +100. Once 100 is reached, the episode is over.

In the zoo, I get some results like

[I 2020-09-29 07:35:38,656] Trial 697 finished with value: -100.0 and parameters: {'gamma': 0.5, 'lr': 0.009853989305797941, 'learning_starts': 50, 'batch_size': 64, 'buffer_size': 100000, 'train_freq': 256, 'tau': 0.01, 'ent_coef': 'auto', 'net_arch': 'deep', 'target_entropy': -100}. Best is trial 650 with value: -100.0.

That means for the last 40 steps after 50,000 timesteps, all the episodes finish with just one step, and directly get reward +100, which is kinda too good to be true. So I used the recommended parameters and do the real training to the same env and I used 40 episodes to calculate the mean ep_reward. But after 50,000 timesteps, the mean ep_reward was only around -900, which is far from success in each episode.

Notice that there are two trials give -100. The similar "irreproducity" happens to other trials as well. Is this something known to the zoo, or is there anything I did wrongly?

BTW, I use the same random seed as in the zoo, i.e.,

SEED = 0
np.random.seed(SEED)

The code I used in the callback to calculate mean ep_reward.

def _on_step(self) -> bool:
        if self.n_calls % self.check_freq == 0:

          # Retrieve training reward
          x, y = ts2xy(load_results(self.log_dir), 'timesteps')
          if len(x) > 0:
              # Mean training reward over the last 40 episodes
              mean_reward = np.mean(y[-40:])
              if self.verbose > 0:
                print("Num timesteps: {}".format(self.num_timesteps))
                print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(self.best_mean_reward, mean_reward))

              # New best model, you could save the agent here
              if mean_reward > self.best_mean_reward:
                  self.best_mean_reward = mean_reward
                  # Example for saving best model
                  if self.verbose > 0:
                    print("Saving new best model to {}".format(self.save_path))
                  self.model.save(self.save_path)
blurLake commented 3 years ago

Here is one example of using trial 697's parameters. It did not go over 50,000 since this is the second time I tried this, so I stopped it earlier.

--------------------------------------
| current_lr              | 0.000591 |
| ep_rewmean              | -1e+03   |
| episodes                | 8460     |
| eplenmean               | 5        |
| fps                     | 14       |
| mean 100 episode reward | -1e+03   |
| n_updates               | 330      |
| time_elapsed            | 2859     |
| total timesteps         | 42300    |
--------------------------------------
Num timesteps: 42320
Best mean reward: -871.11 - Last mean reward per episode: -1010.43
blurLake commented 3 years ago

A little follow up. I just want to ask is it true that in zoo hyperparameter optimization for SAC, the layer_norm = False is default?

araffin commented 3 years ago

Notice that there are two trials give -100. The similar "irreproducity" happens to other trials as well. Is this something known to the zoo, or is there anything I did wrongly?

BTW, I use the same random seed as in the zoo, i.e.,

Different things there. First, you need to make sure that your environment is deterministic. Then the seed is used only at the beginning of training, when doing hyperparameter optimization the seed is not set at every run, which would explain why you cannot reproduce the results using the tuned hyperparameters. Then, if you are using a GPU, as mentioned in the doc, because of TF, we cannot ensure full reproducibility of the run. However this is the case in the PyTorch Stable-Baselines3 version: https://github.com/DLR-RM/stable-baselines3

A little follow up. I just want to ask is it true that in zoo hyperparameter optimization for SAC, the layer_norm = False is default?

yes

blurLake commented 3 years ago

Hi, thanks for the suggestions. I think I found what is the problem. I am using entr_coef= auto in SAC. At certain point, action becomes NaN which leads to state of the env to be NaN also. Since NaN is not incorporated in the condition checking in step function, which leads to doneflag = True even with NaN state.

I guess it is similar to this.

Questions: The previous hyperparameter combination is recommended by roo. Can those trials with NaNs be eliminated already from zoo without recommending it as best trial (or pruning it)?

I saw that we can use VecCheckNan to the env, but it seems step_async and step_wait are needed in the env. Is there an example about how these function look like?

araffin commented 3 years ago

The previous hyperparameter combination is recommended by roo. Can those trials with NaNs be eliminated already from zoo without recommending it as best trial (or pruning it)?

You should raise an exception (assertion error) and the trial will be ignored. See https://github.com/araffin/rl-baselines-zoo/blob/master/utils/hyperparams_opt.py#L112

I saw that we can use VecCheckNan to the env, but it seems step_async and step_wait are needed in the env. Is there an example about how these function look like?

Please read the documentation for that.