[Bug]: Inconsistent results with trained agent in OpenAI gym environment

chrishsr commented 1 year ago

🐛 Bug

I'm using the Proximal Policy Optimization (PPO) algorithm to train an agent in an OpenAI gym environment for trading. After training the agent and saving it, I reload it and run simulations, but the results are inconsistent. Specifically, the simulations right after training produce the desired results, but simulations after reloading the saved agent produce vastly different results that are far from the desired outcome. This happens even if I run the same simulation multiple times.

Please note that I cannot upload the environment since it requires several gigabytes of data used by the environment.

I suspect that the issue may be related to some model dependencies that are still in the GPU memory after the training is completed. These dependencies may be cleared only when the program code is terminated and the VS Code instance is closed. Therefore, when I reload the saved agent and run simulations, these dependencies may still be present in the GPU memory and interfere with the agent's behavior.

I'm looking for help understanding why this inconsistency is occurring and how to fix it.

Edit: please change the flair if I picked the wrong one.

To Reproduce

#The shut the fuck up section
import os
print(os.environ["CUDA_PATH"])
#os.environ["CUDA_PATH"] = 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.2'
#print(os.environ["CUDA_PATH"])

# https://stackoverflow.com/questions/40426502/is-there-a-way-to-suppress-the-messages-tensorflow-prints/40426709
#os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # or any {'0', '1', '2'}
os.environ["CUDA_VISIBLE_DEVICES"] = "1"  # specify which GPU(s) to be used #0 is RTX 3090 #0 is RTX 3080
import warnings
# https://stackoverflow.com/questions/15777951/how-to-suppress-pandas-future-warning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=Warning)

import torch
import torch.nn as nn

print(f'PyTorch version : {torch.__version__}')
print(f'Cuda available  : {torch.cuda.is_available()}')
print(f'Device count    : {torch.cuda.device_count()}')
print(f'Using device Nr.: {torch.cuda.current_device()}')
print(f'Device name     : {torch.cuda.get_device_name(0)}')

import numpy  as np

import gym
from gym.envs.registration import register

from stable_baselines3 import A2C, DDPG, DQN, HER, PPO, SAC, TD3

np.random.seed(42)

log_path         = f'./log'
image_path       = f'./images'
model_dir        = f'{log_path}/Agents/model'
model_name       = f'model.zip'
model_path       = f'{model_dir}/{model_name}'
env_path         = f'{log_path}/Environments/ENV/'
tensorboard_path = f'{log_path}/Tensorboard'

if not os.path.exists(log_path):
    os.makedirs(log_path, exist_ok=True)

if not os.path.exists(image_path):
    os.makedirs(image_path, exist_ok=True)

if not os.path.exists(model_dir):
    os.makedirs(model_dir, exist_ok=True)

if not os.path.exists(env_path):
    os.makedirs(env_path, exist_ok=True)

if not os.path.exists(tensorboard_path):
    os.makedirs(tensorboard_path, exist_ok=True)

training_steps  = 100
model_path      = './models' 
period_length   = int(1440/2)

register(
    id='trading-v0',
    entry_point='good_env:TradingEnvironment',
    max_episode_steps=period_length
)

policy_kwargs = dict(activation_fn=torch.nn.ReLU,
                     net_arch=dict(pi=[256, 256], vf=[256, 256]))

training_steps = 1000000
model_path     = './models' 

# PPO
def train_eval_PPO(timeframe     : str = '1m',
                   period_length : int = 720,
                   training_steps: int = 1000000):

    trading_environment = gym.make('trading-v0',
                                   period_length=period_length,
                                   timeframe=timeframe)
    trading_environment.seed(42)

    print(f'Current model: PPO')
    model = PPO(policy='MlpPolicy', policy_kwargs=policy_kwargs, env=trading_environment, verbose=0, tensorboard_log=tensorboard_path)

    model.learn(training_steps, progress_bar=True)

    model.save(f'{model_path}/PPO_{timeframe}')

    del model

    print(f'-> Evaluating.')

    model   = PPO.load(f'{model_path}/PPO_{timeframe}')

    obs     = trading_environment.reset(randomize=False)
    done    = False

    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, done, info = trading_environment.step(action)
    trading_environment.render(mode='save', name=f'PPO_{timeframe}')
    print(f'-> Done.')

#'1h' <- problems with nan in observation
training_frames = ['1m', '4h', '1d']

for timeframe in training_frames: 
    train_eval_PPO(timeframe=timeframe, training_steps=100)

Relevant log output / Error message

No response

System Info

OS: Windows-10-10.0.19045-SP0 10.0.19045
Python: 3.9.7
Stable-Baselines3: 1.8.0
PyTorch: 1.13.0+cu117
GPU Enabled: True
Numpy: 1.23.5
Gym: 0.21.0

Cuda version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.7 PyTorch version : 1.13.0+cu117

Checklist

[X] I have checked that there is no similar issue in the repo
[X] I have read the documentation
[X] I have provided a minimal working example to reproduce the bug
[X] I've used the markdown code blocks for both code and stack traces.

araffin commented 1 year ago

Hello, it's hard to say anything as the provided code doesn't allow to reproduce the issue.

Please note that we do not offer tech support for getting RL working for task X. Your best bet is to read the docs for tips in addition to other sources of RL tips.

The following is an automated answer:

as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader:

Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.

I was a quant trader at a major hedge fund for several years. I am now retired.
On average, traders lose money. On average, retail traders especially lose money. An excellent approximation of trading, and especially of retail trading, is 'gambling'.
There is a lot more bad advice on trading out there than good advice. It is extraordinarily difficult to demonstrate that any particular advice is some of the rare good advice.
As such, it's reasonable to treat all commentary on retail trading as an epsilon away from snake oil salesmanship. Sometimes that'll be wrong, but it's a strong rule of thumb.
I feel a sense of responsibility to the less world-wise members of this community - which includes plenty of highschoolers - and so I find myself unable to let a conversation about retail trading occur without interceding and warning that it's very likely snake oil.
I find repeatedly making these warnings and the subsequent fights to be exhausting.

chrishsr commented 1 year ago

Seems like it was an issue with hot swapping my card. My drivers seem to not like that I installed a 3080 to train on and a 3090 to game simultaneously. After completely re-installing all graphics drivers, and not gaming while training, the bug seems to have disappeared. Sorry for the inconvenience

DLR-RM / stable-baselines3