DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.13k stars 1.7k forks source link

[Bug]: Inconsistent results with trained agent in OpenAI gym environment #1502

Closed chrishsr closed 1 year ago

chrishsr commented 1 year ago

🐛 Bug

I'm using the Proximal Policy Optimization (PPO) algorithm to train an agent in an OpenAI gym environment for trading. After training the agent and saving it, I reload it and run simulations, but the results are inconsistent. Specifically, the simulations right after training produce the desired results, but simulations after reloading the saved agent produce vastly different results that are far from the desired outcome. This happens even if I run the same simulation multiple times.

Please note that I cannot upload the environment since it requires several gigabytes of data used by the environment.

I suspect that the issue may be related to some model dependencies that are still in the GPU memory after the training is completed. These dependencies may be cleared only when the program code is terminated and the VS Code instance is closed. Therefore, when I reload the saved agent and run simulations, these dependencies may still be present in the GPU memory and interfere with the agent's behavior.

I'm looking for help understanding why this inconsistency is occurring and how to fix it.

Edit: please change the flair if I picked the wrong one.

To Reproduce

#The shut the fuck up section
import os
print(os.environ["CUDA_PATH"])
#os.environ["CUDA_PATH"] = 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2'
#print(os.environ["CUDA_PATH"])

# https://stackoverflow.com/questions/40426502/is-there-a-way-to-suppress-the-messages-tensorflow-prints/40426709
#os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}
os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # specify which GPU(s) to be used #0 is RTX 3090 #0 is RTX 3080
import warnings
# https://stackoverflow.com/questions/15777951/how-to-suppress-pandas-future-warning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=Warning)

import torch
import torch.nn as nn

print(f'PyTorch version : {torch.__version__}')
print(f'Cuda available  : {torch.cuda.is_available()}')
print(f'Device count    : {torch.cuda.device_count()}')
print(f'Using device Nr.: {torch.cuda.current_device()}')
print(f'Device name     : {torch.cuda.get_device_name(0)}')

import numpy  as np

import gym
from gym.envs.registration import register

from stable_baselines3 import A2C, DDPG, DQN, HER, PPO, SAC, TD3

np.random.seed(42)

log_path         = f'./log'
image_path       = f'./images'
model_dir        = f'{log_path}/Agents/model'
model_name       = f'model.zip'
model_path       = f'{model_dir}/{model_name}'
env_path         = f'{log_path}/Environments/ENV/'
tensorboard_path = f'{log_path}/Tensorboard'

if not os.path.exists(log_path):
    os.makedirs(log_path, exist_ok=True)

if not os.path.exists(image_path):
    os.makedirs(image_path, exist_ok=True)

if not os.path.exists(model_dir):
    os.makedirs(model_dir, exist_ok=True)

if not os.path.exists(env_path):
    os.makedirs(env_path, exist_ok=True)

if not os.path.exists(tensorboard_path):
    os.makedirs(tensorboard_path, exist_ok=True)

training_steps  = 100
model_path      = './models' 
period_length   = int(1440/2)

register(
    id='trading-v0',
    entry_point='good_env:TradingEnvironment',
    max_episode_steps=period_length
)

policy_kwargs = dict(activation_fn=torch.nn.ReLU,
                     net_arch=dict(pi=[256, 256], vf=[256, 256]))

training_steps = 1000000
model_path     = './models' 

# PPO
def train_eval_PPO(timeframe     : str = '1m',
                   period_length : int = 720,
                   training_steps: int = 1000000):

    trading_environment = gym.make('trading-v0',
                                   period_length=period_length,
                                   timeframe=timeframe)
    trading_environment.seed(42)

    print(f'Current model: PPO')
    model = PPO(policy='MlpPolicy', policy_kwargs=policy_kwargs, env=trading_environment, verbose=0, tensorboard_log=tensorboard_path)

    model.learn(training_steps, progress_bar=True)

    model.save(f'{model_path}/PPO_{timeframe}')

    del model

    print(f'-> Evaluating.')

    model   = PPO.load(f'{model_path}/PPO_{timeframe}')

    obs     = trading_environment.reset(randomize=False)
    done    = False

    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, done, info = trading_environment.step(action)
    trading_environment.render(mode='save', name=f'PPO_{timeframe}')
    print(f'-> Done.')

#'1h' <- problems with nan in observation
training_frames = ['1m', '4h', '1d']

for timeframe in training_frames: 
    train_eval_PPO(timeframe=timeframe, training_steps=100)

Relevant log output / Error message

No response

System Info

Cuda version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7 PyTorch version : 1.13.0+cu117

Checklist

araffin commented 1 year ago

Hello, it's hard to say anything as the provided code doesn't allow to reproduce the issue.

Please note that we do not offer tech support for getting RL working for task X. Your best bet is to read the docs for tips in addition to other sources of RL tips.


The following is an automated answer:

as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader:

Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.

chrishsr commented 1 year ago

Seems like it was an issue with hot swapping my card. My drivers seem to not like that I installed a 3080 to train on and a 3090 to game simultaneously. After completely re-installing all graphics drivers, and not gaming while training, the bug seems to have disappeared. Sorry for the inconvenience