gSDE noise sampling with TQC can raise ValueError due to nan in `log_std`

Description

In some rare cases, (encountered once) noise sampling in gSDE can break. It happened once with TQC on HalfCheetahBulletEnv-v0, after 800k timesteps. For some reason, the entropy loss diverged. Might be related to https://github.com/DLR-RM/rl-baselines3-zoo/issues/322

Run detailed here: https://wandb.ai/openrlbenchmark/sb3/runs/27cez5ua

To reproduce:

python -m rl_zoo3.train --algo tqc --env Ant-v3 --eval-episodes 20 --n-eval-envs 5 --seed 2609763199

Traceback (most recent call last):
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/python-3.9.12/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/gpfslocalsup/pub/anaconda-py3/2021.05/envs/python-3.9.12/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/gpfsdswork/projects/rech/uli/upf82sp/rl-baselines3-zoo/rl_zoo3/train.py", line 283, in <module>
    train()
  File "/gpfsdswork/projects/rech/uli/upf82sp/rl-baselines3-zoo/rl_zoo3/train.py", line 276, in train
    exp_manager.learn(model)
  File "/gpfsdswork/projects/rech/uli/upf82sp/rl-baselines3-zoo/rl_zoo3/exp_manager.py", line 235, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3-contrib/sb3_contrib/tqc/tqc.py", line 296, in learn
    return super().learn(
  File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3/stable_baselines3/common/off_policy_algorithm.py", line 353, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3-contrib/sb3_contrib/tqc/tqc.py", line 205, in train
    self.actor.reset_noise()
  File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3-contrib/sb3_contrib/tqc/policies.py", line 142, in reset_noise
    self.action_dist.sample_weights(self.log_std, batch_size=batch_size)
  File "/gpfsdswork/projects/rech/uli/upf82sp/stable-baselines3/stable_baselines3/common/distributions.py", line 504, in sample_weights
    self.weights_dist = Normal(th.zeros_like(std), std)
  File "/gpfsdswork/projects/rech/uli/upf82sp/env_benchmark/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/gpfsdswork/projects/rech/uli/upf82sp/env_benchmark/lib/python3.9/site-packages/torch/distributions/distribution.py", line 56, in __init__
    raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (300, 6)) of distribution Normal(loc: torch.Size([300, 6]), scale: torch.Size([300, 6])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[   nan, 0.0074, 0.0030, 0.0102, 0.0134, 0.0056],
        [   nan, 0.0036, 0.0056, 0.0066, 0.0092, 0.0084],
        [   nan, 0.0025, 0.0013, 0.0026, 0.0016, 0.0014],
        ...,
        [   nan, 0.0027, 0.0031, 0.0028, 0.0023, 0.0030],
        [   nan, 0.0073, 0.0029, 0.0083, 0.0040, 0.0053],
        [   nan, 0.0036, 0.0014, 0.0052, 0.0019, 0.0019]],
       grad_fn=<ExpBackward0>)

System Info

Describe how the library was installed (pip, docker, source, ...)
Stable-Baselines3: 1.8.0a3
sb3-contrib: 1.7.0
GPU models and configuration : no gpu
Python version: 3.9.12
PyTorch version: 1.13
Gym version: 0.21.0

Stable-Baselines-Team / stable-baselines3-contrib

gSDE noise sampling with TQC can raise ValueError due to nan in `log_std` #146

Description