DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
8.97k stars 1.68k forks source link

[Bug] Unbounded action spaces #897

Closed vadim0x60 closed 2 years ago

vadim0x60 commented 2 years ago

Bug, feature, reasonable design choice - underline your opinion

The RL algorithms implemented in stable-baselines3 do not support unbounded action spaces, i.e.

gym.spaces.Box(shape=some_shape, low=-np.inf, high=np.inf)

🐛 Definitely a Bug

When someone attempts using stable-baselines on an environment with an unbounded action space, the failure happens silently. There is no error of any kind as the "learning" goes on for however many timesteps you have specified, the only problem is that all tensors in the resulting neural networks become float('nan') tensors. I also have not found it mentioned explicitly anywhere in the documentation, so the user has no way to learn that they are using an unsupported environment until many steps into the training and even then it takes some debugging to figure out what was the reason.

To Reproduce

import numpy as np
from stable_baselines3 import TD3
from stable_baselines3.common.monitor import Monitor
import gym

class MirrorEnv(gym.Env):
  def __init__(self):
        self.observation_space = gym.spaces.Box(shape=some_shape, 
                                                low=-np.inf, high=np.inf, 
                                                dtype=np.float32)
        self.action_space = gym.spaces.Box(shape=some_shape, 
                                               low=-np.inf, high=np.inf,
                                               dtype=np.float32)

  def reset(self):
     return self.observation_space.sample()

  def step(self, a):
     obs = a
     r = np.mean(a)
     return obs, r, False, None

env = MirrorEnv()
env = Monitor(env)

model = TD3('MlpPolicy', env, verbose=1)
model.learn(total_timesteps=25000)
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | nan      |
| time/              |          |
|    episodes        | 4        |
|    fps             | 0        |
|    time_elapsed    | 4411     |
|    total_timesteps | 4000     |
| train/             |          |
|    actor_loss      | nan      |
|    critic_loss     | nan      |
|    learning_rate   | 0.001    |
|    n_updates       | 3000     |
---------------------------------

Expected behavior

Either:

---------------------------------
| rollout/           |          |
|    ep_len_mean     | 1e+03    |
|    ep_rew_mean     | 42       |
| time/              |          |
|    episodes        | 4        |
|    fps             | 0        |
|    time_elapsed    | 4411     |
|    total_timesteps | 4000     |
| train/             |          |
|    actor_loss      | 42       |
|    critic_loss     | 42       |
|    learning_rate   | 0.001    |
|    n_updates       | 3000     |
---------------------------------

(42 is used to represent 'some number' without loss of generality)

Or:

ValueError("Unsupported action space: make sure your Box space has a finite lower and upper bound")

 System Info

OS: Linux-4.15.0-176-generic-x86_64-with-glibc2.27 #185-Ubuntu SMP Tue Mar 29 17:40:04 UTC 2022
Python: 3.9.12
Stable-Baselines3: 1.5.0
PyTorch: 1.11.0+cu102
GPU Enabled: True
Numpy: 1.22.3
Gym: 0.21.0

Additional context

This hints, of course, at a much bigger issue in the field of Reinforcement Learning that there is implicit knowledge like "unbounded action spaces are technically allowed by gym, but they aren't a thing in modern RL methods" that exists in everyone's heads, but not in the documentation of our tools.

Checklist

araffin commented 2 years ago

Hello, I would say that infinity does not exist in the real world, so in practice, you will always have a lower and upper bound (be it large) that you should be able to specify. Anyway, as mentioned in our documentation (see tips and tricks section), we do not recommend un-normalized action spaces (the env checker will warn you about that).

ValueError("Unsupported action space: make sure your Box space has a finite lower and upper bound")

I do agree, we would appreciate a PR that solves this issue and updates the env checker that currently only outputs a warning.

Miffyli commented 2 years ago

I wonder if the SB3 core code should raise an exception for this situation, as env checker is not explicitly required to run agents. If indeed setting inf in Box space results in nans/invalid values, it should be safe-guarded with an informative exception, in my opinion.