Closed blurLake closed 3 years ago
Here is one example of using trial 697's parameters. It did not go over 50,000 since this is the second time I tried this, so I stopped it earlier.
--------------------------------------
| current_lr | 0.000591 |
| ep_rewmean | -1e+03 |
| episodes | 8460 |
| eplenmean | 5 |
| fps | 14 |
| mean 100 episode reward | -1e+03 |
| n_updates | 330 |
| time_elapsed | 2859 |
| total timesteps | 42300 |
--------------------------------------
Num timesteps: 42320
Best mean reward: -871.11 - Last mean reward per episode: -1010.43
A little follow up. I just want to ask is it true that in zoo hyperparameter optimization for SAC, the layer_norm = False is default?
Notice that there are two trials give -100. The similar "irreproducity" happens to other trials as well. Is this something known to the zoo, or is there anything I did wrongly?
BTW, I use the same random seed as in the zoo, i.e.,
Different things there. First, you need to make sure that your environment is deterministic. Then the seed is used only at the beginning of training, when doing hyperparameter optimization the seed is not set at every run, which would explain why you cannot reproduce the results using the tuned hyperparameters. Then, if you are using a GPU, as mentioned in the doc, because of TF, we cannot ensure full reproducibility of the run. However this is the case in the PyTorch Stable-Baselines3 version: https://github.com/DLR-RM/stable-baselines3
A little follow up. I just want to ask is it true that in zoo hyperparameter optimization for SAC, the layer_norm = False is default?
yes
Hi, thanks for the suggestions. I think I found what is the problem. I am using entr_coef= auto in SAC. At certain point, action becomes NaN which leads to state of the env to be NaN also. Since NaN is not incorporated in the condition checking in step function, which leads to doneflag = True even with NaN state.
I guess it is similar to this.
Questions: The previous hyperparameter combination is recommended by roo. Can those trials with NaNs be eliminated already from zoo without recommending it as best trial (or pruning it)?
I saw that we can use VecCheckNan to the env, but it seems step_async and step_wait are needed in the env. Is there an example about how these function look like?
The previous hyperparameter combination is recommended by roo. Can those trials with NaNs be eliminated already from zoo without recommending it as best trial (or pruning it)?
You should raise an exception (assertion error) and the trial will be ignored. See https://github.com/araffin/rl-baselines-zoo/blob/master/utils/hyperparams_opt.py#L112
I saw that we can use VecCheckNan to the env, but it seems step_async and step_wait are needed in the env. Is there an example about how these function look like?
Please read the documentation for that.
Hi, I am using zoo to optimise the parameters for SAC with a customised env. The code I used was
I use --eval-episodes = 40 to have agents with more stable performance.
Something about the env. Each episode is at most 5 steps long. The rewards for usual steps are negative value of some Euclidean norm, say -||x-x_target||, and the successful step will get reward +100. Once 100 is reached, the episode is over.
In the zoo, I get some results like
That means for the last 40 steps after 50,000 timesteps, all the episodes finish with just one step, and directly get reward +100, which is kinda too good to be true. So I used the recommended parameters and do the real training to the same env and I used 40 episodes to calculate the mean ep_reward. But after 50,000 timesteps, the mean ep_reward was only around -900, which is far from success in each episode.
Notice that there are two trials give -100. The similar "irreproducity" happens to other trials as well. Is this something known to the zoo, or is there anything I did wrongly?
BTW, I use the same random seed as in the zoo, i.e.,
The code I used in the callback to calculate mean ep_reward.