DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.06k stars 1.7k forks source link

When training SAC/PPO models with a custom environment, I consistently observe the model predicting the same action. #1801

Closed oldwhitenotblack closed 9 months ago

oldwhitenotblack commented 10 months ago

🐛 Bug

When training SAC/PPO models with a custom environment, I consistently observe the model predicting the same action. Even when I set a large penalty in the environment to discourage repeated selection of the same action, the issue persists.

屏幕截图 2024-01-09 153015

Code example

log_dir="new_sac_log" new_log=configure(log_dir,['stdout','csv','log'])

env = Monitor(env, log_dir) model = SAC("MultiInputPolicy",env,tensorboard_log=log_dir,verbose=1) model.learn(total_timesteps=10000, log_interval=2,tb_log_name="sac_10000_positiveIMP",progress_bar =True) model.save("sac_pendulum")

del model # remove to demonstrate saving and loading

model = SAC.load("sac_pendulum")

obs, info = env.reset()

while True: action, _states = model.predict(obs, deterministic=True) file_a = open("action_list.txt", "a") file_a.write(str(action)) file_a.close() obs, reward, terminated, truncated, info = env.step(action) if terminated or truncated: break

Relevant log output / Error message

Below are my state space, action space, and the results returned by check_env.
check_env() result:
C:\Users\notblack\.conda\envs\contaminatedPD_SB3\lib\site-packages\stable_baselines3\common\env_checker.py:251: UserWarning: Your observation ContaminateStateMatrix has an unconventional shape (neither an image, nor a 1D vector). We recommend you to flatten the observation to have only a 1D vector or use a custom policy to properly process the data.
  warnings.warn(
C:\Users\notblack\.conda\envs\contaminatedPD_SB3\lib\site-packages\stable_baselines3\common\env_checker.py:251: UserWarning: Your observation PhysicalDesignSubmission has an unconventional shape (neither an image, nor a 1D vector). We recommend you to flatten the observation to have only a 1D vector or use a custom policy to properly process the data.
  warnings.warn(
C:\Users\notblack\.conda\envs\contaminatedPD_SB3\lib\site-packages\stable_baselines3\common\env_checker.py:251: UserWarning: Your observation UnitUsedMatrix has an unconventional shape (neither an image, nor a 1D vector). We recommend you to flatten the observation to have only a 1D vector or use a custom policy to properly process the data.
  warnings.warn(
C:\Users\notblack\.conda\envs\contaminatedPD_SB3\lib\site-packages\stable_baselines3\common\env_checker.py:441: UserWarning: We recommend you to use a symmetric and normalized Box action space (range=[-1, 1]) cf. https://stable-baselines3.readthedocs.io/en/master/guide/rl_tips.html
  warnings.warn(
C:\Users\notblack\.conda\envs\contaminatedPD_SB3\lib\site-packages\stable_baselines3\common\env_checker.py:452: UserWarning: Your action space has dtype int32, we recommend using np.float32 to avoid cast errors.
  warnings.warn(

self.observation_space:
Dict('ContaminateStateMatrix': Box(0, 999, (10, 10), int32), 'PhysicalDesignSubmission': Box(0, 10, (10, 10), int32), 'UnitUsedMatrix': Box(0, 10, (10, 10), int32))

self.action_space:
Box(0, 9, (2,), int32)

System Info

Checklist

fracapuano commented 9 months ago

Hey, just flagging in lots of circumstances I have had similar issues with custom envs when I was starting over. Now, I almost always avoid said issues by ensuring my custom envs pass a check_env process (from stable_baselines3.common.env_checker import check_env).

Ofc, happy to help you further if this does not solve your issue :) You might as well give a little bit more detail on your bug (sharing a link to the repo where you defined your custom env would be great, otherwise happy if you could share some more info on the environment right here)

araffin commented 9 months ago

If code there is, it is minimal and working

Closing because the minimum requirements for seeking help are not met.