Agent can't learn on future data

gendrelom commented 2 years ago

I used example from examples folder and tried two different type of runs: with future data(you can find this code by "#Future data") and without. I expected to see a big difference in rewards betweent these two runs(or at lease some difference).

import numpy as np
import pandas as pd

import gym
import gym_anytrading
import quantstats as qs

from stable_baselines import A2C
from stable_baselines.common.vec_env import DummyVecEnv

import matplotlib.pyplot as plt

df = gym_anytrading.datasets.STOCKS_GOOGL.copy()

#Future data
#enable or disable these lines
for i in range(10):
  df[f'future_{i}'] = df['Close'].shift(-i - 1)

window_size = 10
start_index = window_size
end_index = len(df)

env_maker = lambda: gym.make(
    'stocks-v0',
    df = df,
    window_size = window_size,
    frame_bound = (start_index, end_index)
)

env = DummyVecEnv([env_maker])

rewards = []

def simple_callback(_locals, _globals):
    rewards.append(_locals['rewards'])

policy_kwargs = dict(net_arch=[64, 'lstm', dict(vf=[128, 128, 128], pi=[64, 64])])
model = A2C('MlpLstmPolicy', env, verbose=1, policy_kwargs=policy_kwargs)
model.learn(total_timesteps=50000, callback=simple_callback)

#result:
np.array(rewards).mean()

But instead of reward difference I got almost the same results for np.array(rewards).mean():

#reward mean with future data: 1 run: 0.22615711, 2 run: 0.19079535
#reward mean without future data: 1 run: 0.20926535, 2 run: 0.2303483

I think this is important because if agent can't even learn on cheat data, then what is the reason to try to find some useful indicators? What do you think and how can I fix it?

AminHP commented 2 years ago

Hi @gendrelom ,

Have you overridden the _process_data method? You should inherit the SotcksEnv class and override this method in order to add your new features.

gendrelom commented 2 years ago

@AminHP thanks, I tried this script:

import numpy as np
import pandas as pd

import gym
import gym_anytrading
import quantstats as qs

from stable_baselines import A2C
from stable_baselines.common.vec_env import DummyVecEnv

import matplotlib.pyplot as plt
from gym_anytrading.envs import StocksEnv

from gym.envs.registration import register
from copy import deepcopy

def my_process_data(self):
      prices = self.df.loc[:, 'Close'].to_numpy()

      prices[self.frame_bound[0] - self.window_size]  # validate index (TODO: Improve validation)
      prices = prices[self.frame_bound[0]-self.window_size:self.frame_bound[1]]

      diff = np.insert(np.diff(prices), 0, 0)
      signal_features = np.column_stack((prices, diff))

      df2 = pd.DataFrame()
      for i in range(10):
        df2[f'future_{i}'] = self.df['Close'].shift(-i - 1)
      df2.fillna(-1, inplace=True)
      signal_features = np.concatenate((df2.to_numpy(), signal_features), axis = 1)

      print(type(signal_features))
      print((signal_features.shape))
      print(signal_features)
      return prices, signal_features

class StocksEnv2(StocksEnv):
    _process_data = my_process_data

df = gym_anytrading.datasets.STOCKS_GOOGL.copy()

window_size = 10
start_index = window_size
end_index = len(df)

env_maker = lambda: StocksEnv2(
    df = df,
    window_size = window_size,
    frame_bound = (start_index, end_index)
)

env = DummyVecEnv([env_maker])

rewards = []

def simple_callback(_locals, _globals):
    rewards.append(_locals['rewards'])

policy_kwargs = dict(net_arch=[64, 'lstm', dict(vf=[128, 128, 128], pi=[64, 64])])
model = A2C('MlpLstmPolicy', env, verbose=1, policy_kwargs=policy_kwargs)
model.learn(total_timesteps=50000, callback=simple_callback)
np.array(rewards).mean()

But again got a small mean reward (0.22072971)

AminHP commented 2 years ago

I don't think the mean reward is a good metric here. The agent is trying to explore the space, causing it to achieve a low reward in most steps, and we don't know whether it is converging. Note that according to the reward function implemented here, the step reward is 0 unless a trade happens. So, there are a few non-zero rewards among many zero rewards, which leads to small changes in mean reward.

Using the code below, I checked max(rewards) and env.envs[0].history['total_reward'][-1]. They are different with k=0 and k=20: (157, 33) for k=20 and (139, 16) for k=0.

Anyway, if you want to reach better results, you should consider some improvements in signal features and reward function. The current features might not be enough, although we have provided information from the future.

import numpy as np
import pandas as pd

import gym
import gym_anytrading

from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import DummyVecEnv

import matplotlib.pyplot as plt
from gym_anytrading.envs import StocksEnv

from gym.envs.registration import register
from copy import deepcopy

k = 0

class StocksEnv2(StocksEnv):

    def __init__(self, df, window_size, frame_bound):
        window_size += k
        super().__init__(df, window_size, frame_bound)
        self._start_tick -= k
        self._end_tick -= k

    def _get_observation(self):
        return self.signal_features[(self._current_tick-self.window_size+k):self._current_tick+k]

df = gym_anytrading.datasets.STOCKS_GOOGL.copy()

window_size = 10
start_index = window_size + k
end_index = len(df)

env_maker = lambda: StocksEnv2(
    df = df,
    window_size = window_size,
    frame_bound = (start_index, end_index)
)

env = DummyVecEnv([env_maker])

rewards = []

def simple_callback(_locals, _globals):
    rewards.append(_locals['rewards'])

model = A2C('MlpPolicy', env, verbose=0)
model.learn(total_timesteps=200000, callback=simple_callback)

print(np.array(rewards).mean())
print(max(rewards))
print(env.envs[0].history['total_reward'][-1])

gendrelom commented 2 years ago

@AminHP thank you very much for your response! I tried your code, I ran 3 times with k=0 and 3 times with k=30, and here are the results:

k = 0:

# 0.195403
# [154.96997]
# 39.704663999999894

# 0.21608569
# [153.64996]
# 16.501539000000008

# 0.20421642
# [127.890015]
# 54.2843059999999

# averages for k = 0:
# np.array(rewards).mean(): 0.20523503666666665
# max(rewards): [145.503315]
# env.envs[0].history['total_reward'][-1]: 36.8301696666666

k = 30:

# 0.1883287
# [134.69]
# 75.48052900000005

# 0.20450637
# [154.96997]
# 19.829958000000204

# 0.21548672
# [157.91998]
# -34.09904899999992

# averages for k = 30:
# np.array(rewards).mean(): 0.20277393
# max(rewards): [149.19331666666668]
# env.envs[0].history['total_reward'][-1]: 20.403812666666777

So, the mean again the same for both, the max reward is a bit more for k=30, but the total_reward is a lot bigger for k=0 which is more statistically important metric, I think. Why is it so and how can I fix it?

AminHP commented 2 years ago

My numbers for the two other metrics were actually an average of multiple runs. So they are somehow statistically significant.

But as I said earlier, the overall problem is related to the provided features and reward function. This library is just a simple codebase for the beginning. If you want to produce satisfactory results, you should work on it and improve its components. For instance, add some indicators to the signal features and other stuff. The current implementation may not achieve excellent results in all situations. I was able to produce reach outcomes with a Q-Learning that I implemented myself. However, it wasn't possible with PPO and needed so much more modifications. Besides, passing future data as features to the network cannot help if the model is not able to capture relations properly.

gendrelom commented 2 years ago

@AminHP

My numbers for the two other metrics were actually an average of multiple runs. So they are somehow statistically significant.

For the max(rewards) and env.envs[0].history['total_reward'][-1]? Got it, how many runs did you use?

Also, I did it because I want to understand a method by which I can compare two models, this method should give an unambiguous result. For example, I tried to compare total_reward of two models after one run: one model with future data, second model without future data. total_reward after one run showed bad result - agent which clearly has potential(because it's used future data) got a worse result than second agent. So, what do you do to understand exactly if the indicator can make the model better?

AminHP commented 2 years ago

Three times but apparently it is not enough.

Generally, the dataframe should be split into two parts in order to compare different models. One for training and the other for testing. Train each agent with the training dataset and then test it using the testing dataset. The total reward and profit earned by agents on the testing data can be a metric for comparison purposes. The mean reward you were working with is just a measure to determine the learning curve of the agent over episodes. It is not a metric for comparing different models.

AminHP / gym-anytrading

Agent can't learn on future data #63