DLR-RM / rl-baselines3-zoo

A training framework for Stable Baselines3 reinforcement learning agents, with hyperparameter optimization and pre-trained agents included.
https://rl-baselines3-zoo.readthedocs.io
MIT License
1.89k stars 494 forks source link

[Bug]: Nan Problems for SAC, TQC, for AntBulletEnv-v0, HalfCheetahBulletEnv-v0 #427

Open ZJEast opened 7 months ago

ZJEast commented 7 months ago

šŸ› Bug

Hello. I am trying to reproduce some algorithms or experiments, to record some data. But some expectation happens, nan is generated for some unknown reasons. Any advice to solve?

To Reproduce

python -u ../../rl-baselines3-zoo-master/train.py --algo sac --env AntBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
python -u ../../rl-baselines3-zoo-master/train.py --algo sac --env HalfCheetahBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
python -u ../../rl-baselines3-zoo-master/train.py --algo tqc --env AntBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
python -u ../../rl-baselines3-zoo-master/train.py --algo tqc --env HalfCheetahBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs

Relevant log output / Error message

python -u ../../rl-baselines3-zoo-master/train.py --algo sac --env AntBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
Traceback (most recent call last):
  File "/share/home/zhangjundong/exp/sac-AntBulletEnv-v0/../../rl-baselines3-zoo-master/train.py", line 4, in <module>
    train()
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/train.py", line 272, in train
    exp_manager.learn(model)
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/exp_manager.py", line 240, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/sac/sac.py", line 307, in learn
    return super().learn(
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/sac/sac.py", line 219, in train
    self.actor.reset_noise()
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/sac/policies.py", line 145, in reset_noise
    self.action_dist.sample_weights(self.log_std, batch_size=batch_size)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/distributions.py", line 508, in sample_weights
    self.weights_dist = Normal(th.zeros_like(std), std)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/distribution.py", line 68, in __init__
    raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (300, 8)) of distribution Normal(loc: torch.Size([300, 8]), scale: torch.Size([300, 8])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<ExpBackward0>)
python -u ../../rl-baselines3-zoo-master/train.py --algo sac --env HalfCheetahBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
Traceback (most recent call last):
  File "/share/home/zhangjundong/exp/sac-HalfCheetahBulletEnv-v0/../../rl-baselines3-zoo-master/train.py", line 4, in <module>
    train()
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/train.py", line 272, in train
    exp_manager.learn(model)
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/exp_manager.py", line 240, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/sac/sac.py", line 307, in learn
    return super().learn(
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/sac/sac.py", line 219, in train
    self.actor.reset_noise()
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/sac/policies.py", line 145, in reset_noise
    self.action_dist.sample_weights(self.log_std, batch_size=batch_size)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/distributions.py", line 508, in sample_weights
    self.weights_dist = Normal(th.zeros_like(std), std)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/distribution.py", line 68, in __init__
    raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (300, 6)) of distribution Normal(loc: torch.Size([300, 6]), scale: torch.Size([300, 6])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        ...,
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]], device='cuda:0',
       grad_fn=<ExpBackward0>)
python -u ../../rl-baselines3-zoo-master/train.py --algo tqc --env AntBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
Traceback (most recent call last):
  File "/share/home/zhangjundong/exp/tqc-AntBulletEnv-v0/../../rl-baselines3-zoo-master/train.py", line 4, in <module>
    train()
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/train.py", line 272, in train
    exp_manager.learn(model)
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/exp_manager.py", line 240, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/share/home/zhangjundong/stable-baselines3-contrib-master/sb3_contrib/tqc/tqc.py", line 302, in learn
    return super().learn(
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/share/home/zhangjundong/stable-baselines3-contrib-master/sb3_contrib/tqc/tqc.py", line 213, in train
    self.actor.reset_noise()
  File "/share/home/zhangjundong/stable-baselines3-contrib-master/sb3_contrib/tqc/policies.py", line 144, in reset_noise
    self.action_dist.sample_weights(self.log_std, batch_size=batch_size)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/distributions.py", line 508, in sample_weights
    self.weights_dist = Normal(th.zeros_like(std), std)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/distribution.py", line 68, in __init__
    raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (300, 8)) of distribution Normal(loc: torch.Size([300, 8]), scale: torch.Size([300, 8])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<ExpBackward0>)
python -u ../../rl-baselines3-zoo-master/train.py --algo tqc --env HalfCheetahBulletEnv-v0 --n-timesteps 20000000 --tensorboard-log tf-logs
Traceback (most recent call last):
  File "/share/home/zhangjundong/exp/tqc-HalfCheetahBulletEnv-v0/../../rl-baselines3-zoo-master/train.py", line 4, in <module>
    train()
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/train.py", line 272, in train
    exp_manager.learn(model)
  File "/share/home/zhangjundong/rl-baselines3-zoo-master/rl_zoo3/exp_manager.py", line 240, in learn
    model.learn(self.n_timesteps, **kwargs)
  File "/share/home/zhangjundong/stable-baselines3-contrib-master/sb3_contrib/tqc/tqc.py", line 302, in learn
    return super().learn(
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/off_policy_algorithm.py", line 347, in learn
    self.train(batch_size=self.batch_size, gradient_steps=gradient_steps)
  File "/share/home/zhangjundong/stable-baselines3-contrib-master/sb3_contrib/tqc/tqc.py", line 213, in train
    self.actor.reset_noise()
  File "/share/home/zhangjundong/stable-baselines3-contrib-master/sb3_contrib/tqc/policies.py", line 144, in reset_noise
    self.action_dist.sample_weights(self.log_std, batch_size=batch_size)
  File "/share/home/zhangjundong/stable-baselines3-master/stable_baselines3/common/distributions.py", line 508, in sample_weights
    self.weights_dist = Normal(th.zeros_like(std), std)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/normal.py", line 56, in __init__
    super().__init__(batch_shape, validate_args=validate_args)
  File "/share/home/zhangjundong/.local/lib/python3.9/site-packages/torch/distributions/distribution.py", line 68, in __init__
    raise ValueError(
ValueError: Expected parameter scale (Tensor of shape (300, 6)) of distribution Normal(loc: torch.Size([300, 6]), scale: torch.Size([300, 6])) to satisfy the constraint GreaterThan(lower_bound=0.0), but found invalid values:
tensor([[0.0026, 0.0041,    nan, 0.0036, 0.0046, 0.0034],
        [0.0054, 0.0040,    nan, 0.0035, 0.0053, 0.0054],
        [0.0192, 0.0061,    nan, 0.0105, 0.0105, 0.0105],
        ...,
        [0.0257, 0.0262,    nan, 0.0058, 0.0023, 0.0098],
        [0.1410, 0.0130,    nan, 0.1707, 0.1281, 0.0216],
        [0.0494, 0.0480,    nan, 0.0506, 0.0509, 0.0487]], device='cuda:0',
       grad_fn=<ExpBackward0>)

System Info

Checklist

qgallouedec commented 7 months ago

This may be due to a learning rate too high, see https://github.com/DLR-RM/rl-baselines3-zoo/issues/156#issuecomment-910097343; do you use the default hyperparams?

Also related (and probably duplicate): https://github.com/DLR-RM/stable-baselines3/issues/1401 and https://github.com/DLR-RM/stable-baselines3/issues/1418

ZJEast commented 7 months ago

yes, I use the default hyperparams, I try different learning rate later.

araffin commented 7 months ago

Hello, thanks for sharing the bug report. Does the NaN happen only for some runs or for all runs? Could you log and share a failed run using W&B? (that would allow us to take a look at all the logged data)

I also assume you are using pybullet gymnasium repo?

I'll try to reproduce the issue in the meantime.

Also related: https://github.com/DLR-RM/stable-baselines3/issues/1372 changing to AdamW might solve the problem too.

ZJEast commented 7 months ago

I have tried TD3, SAC, TQC on some pybullet envs. And it only happens for the task I mention, the others is fine. I install pybullet env by 'pip install -r ./requirements.txt' .

I can upload some log file.

sac-AntBulletEnv-v0.zip sac-HalfCheetahBulletEnv-v0.zip tqc-AntBulletEnv-v0.zip tqc-HalfCheetahBulletEnv-v0.zip

araffin commented 7 months ago

Thanks =)

Looking at the log it seems to be due to an explosion of std (and you are using a much larger budget that the one we were using by default). So, setting use_expln=True (and maybe using AdamW) should solve your issue.

I would appreciate a PR that adds this parameter =)

Hmm, for TD3 it is weird if it happens as it doesn't rely on any distribution.

EDIT: I guess the issue is similar to https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/issues/146 by @qgallouedec

qgallouedec commented 7 months ago

Bug already encountered in openrlbenchmark, ~I might have forgotten to report it~: https://wandb.ai/openrlbenchmark/sb3/runs/27cez5ua EDIT: I did report it, you're right @araffin ;)

qgallouedec commented 7 months ago

For TD3, I only found two runs where you have an explosion of the losses, but this didn't lead to the bug: https://wandb.ai/openrlbenchmark/sb3/runs/2qdjqemd (Walker2DBulletEnv-v0) https://wandb.ai/openrlbenchmark/sb3/runs/ffc7kx3m (BipedalWalkerHardcore-v0) What a wonderful tool openrlbenchmark is, ping @vwxyzjn ;)

ZJEast commented 7 months ago

after I change the hyperparams from

policy_kwargs: "dict(log_std_init=-3, net_arch=[400, 300])"

to

policy_kwargs: "dict(log_std_init=-3, net_arch=[400, 300], use_expln=True)"

this problem never happens again, so let's close this issue

araffin commented 7 months ago

Thanks for trying out =) i'm reopening as we need to change the defaults (we would welcome a PR).