DLR-RM / stable-baselines3

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.
https://stable-baselines3.readthedocs.io
MIT License
9.08k stars 1.7k forks source link

[Question] Thoughts on ideal training environment to save best agent during training when using multiple env's? #714

Closed windowshopr closed 2 years ago

windowshopr commented 2 years ago

Question

I've made use of the following snippet:

    num_cpu = 6 # Number of processes to use
    # Create the vectorized environment
    env = WilKin_Stock_Trading_Environment(df, lookback_window_size=lookback_window_size)
    env = SubprocVecEnv([make_env(env, i) for i in range(num_cpu)])

    model = A2C('MlpPolicy', env, verbose=1, gamma=0.91)

With this, it's my understanding that there are 6 agents being trained at once, split up among the total_timestamps defined during training, which results in 6 different test results.

How does one combine these results into 1 agent for testing? Is there a way to pick the best agent?

Or, how could one make use of a callback so that the best "agent" gets checkpointed as training goes along?

Was thinking of using this snippet for the latter:

checkpoint_callback = CheckpointCallback(save_freq=1000, save_path='./logs/')
eval_callback = EvalCallback(eval_env, best_model_save_path='./logs/best_model',
                             log_path='./logs/results', eval_freq=500)
callback = CallbackList([checkpoint_callback, eval_callback])

Sort of a newbie with SB3. Thanks!

Miffyli commented 2 years ago

No, there is only one agent being trained using six copies of the environment. This can speed up training (faster stepping of environment) but also stabilizes the training of A2C/PPO because you have a larger number of samples per update.

See this example on how to use callbacks to save the best model.


The following is an automated answer:

as you seem to try to apply RL to stock trading, i also must warn you about it. Here is recommendation from a former professional trader:

Retail trading, retail trading with ML, and retail trading with RL are bad ideas for almost everyone to get involved with.

  • I was a quant trader at a major hedge fund for several years. I am now retired.
  • On average, traders lose money. On average, retail traders especially lose money. An excellent approximation of trading, and especially of retail trading, is 'gambling'.
  • There is a lot more bad advice on trading out there than good advice. It is extraordinarily difficult to demonstrate that any particular advice is some of the rare good advice.
  • As such, it's reasonable to treat all commentary on retail trading as an epsilon away from snake oil salesmanship. Sometimes that'll be wrong, but it's a strong rule of thumb.
  • I feel a sense of responsibility to the less world-wise members of this community - which includes plenty of highschoolers - and so I find myself unable to let a conversation about retail trading occur without interceding and warning that it's very likely snake oil.
  • I find repeatedly making these warnings and the subsequent fights to be exhausting.
windowshopr commented 2 years ago

Thanks I'll review the callback!

araffin commented 2 years ago

See this example on how to use callbacks to save the best model.

well, the EvalCallback is the recommended way to go (the default in the RL Zoo), the example in the doc is just to demonstrate the use of callback.

Radilx commented 2 years ago

@windowshopr How did you define your eval_env? Doesn't it have to be the same type as the training env, and therefore SubprocVecEnv? Is it just a SubprocVecEnv with one element or something else.

windowshopr commented 2 years ago

@Radilx I used the SubprocVecEnv for the training, and then just a make_vec_env for the eval env. This is what it all looks like:

    num_cpu = 6 # Number of processes to use
    # Create the vectorized training environment
    env = WilKin_Stock_Trading_Environment(train_df, lookback_window_size=lookback_window_size)
    env = Monitor(env)
    env = SubprocVecEnv([make_env(env, i) for i in range(num_cpu)])

    # Eval Env
    eval_env = WilKin_Stock_Trading_Environment(test_df, lookback_window_size=lookback_window_size)
    eval_env = make_vec_env(lambda: eval_env, n_envs=1)

And then you just run the training using the env, and then the testing on the eval_env

windowshopr commented 2 years ago

Not entirely sure I'm doing this right yet.

This is what I have for a setup right now, trying to work in the eval callback:

# Define the train and eval datasets
train_df = df.iloc[0:int(len(df)*0.75),:]
test_df = df.iloc[int(len(df)*0.75):,:]

# Number of processes to use
num_cpu = 3

# Create the vectorized training environment
env = WilKin_Stock_Trading_Environment(train_df, lookback_window_size=16)
env = Monitor(env)
env = SubprocVecEnv([make_env(env, i) for i in range(num_cpu)])

# Create the eval environment on the test dataset
eval_env = WilKin_Stock_Trading_Environment(test_df, lookback_window_size=16)
eval_env = Monitor(eval_env)
eval_env = SubprocVecEnv([make_env(eval_env, i) for i in range(num_cpu)])

# Model the training dataset
model = A2C('MlpPolicy', env, verbose=1, 
            gamma=0.95, 
            n_steps=32, 
            learning_rate=0.0001, 
            tensorboard_log="./a2c_tensorboard/")

# Total number of training iterations
iterations = 240

# Calculate how many timesteps you want to do, i.e. how many times you want
# to run through the training dataset using iterations above
total_timesteps = (len(train_df)-lookback_window_size) * iterations

# Define the evaluation callback on the eval env.
# Only need to evaluate after 1 training iteration re: eval_freq below
eval_callback = EvalCallback(eval_env, 
                         best_model_save_path='./logs/',
                         log_path='./logs/',
                         eval_freq=(len(train_df)-lookback_window_size),
                         deterministic=True, 
                         render=False)

# Start learning
model.learn(total_timesteps=total_timesteps, 
            eval_freq=(len(train_df)-lookback_window_size), 
            n_eval_episodes=2, 
            callback=eval_callback)

Every time it evaluates however, the mean reward is always 0, but the rollout mean reward is going up as training goes along. My eval verbose's look like this:

Eval num_timesteps=33672, episode_reward=0.00 +/- 0.00
Episode length: 11223.00 +/- 0.00
-------------------------------------
| eval/                 |           |
|    mean_ep_length     | 1.12e+04  |
|    mean_reward        | 0         |
| time/                 |           |
|    total_timesteps    | 33672     |
| train/                |           |
|    entropy_loss       | -0.868    |
|    explained_variance | -0.000936 |
|    learning_rate      | 0.0001    |
|    n_updates          | 350       |
|    policy_loss        | 0.173     |
|    value_loss         | 0.386     |
-------------------------------------
New best mean reward!

# ...some ways down is the next one:

Eval num_timesteps=67344, episode_reward=0.00 +/- 0.00
Episode length: 11223.00 +/- 0.00
------------------------------------
| eval/                 |          |
|    mean_ep_length     | 1.12e+04 |
|    mean_reward        | 0        |
| time/                 |          |
|    total_timesteps    | 67344    |
| train/                |          |
|    entropy_loss       | -0.63    |
|    explained_variance | -0.0295  |
|    learning_rate      | 0.0001   |
|    n_updates          | 701      |
|    policy_loss        | 0.0327   |
|    value_loss         | 0.0697   |
------------------------------------

# ...

Eval num_timesteps=101016, episode_reward=0.00 +/- 0.00
Episode length: 11223.00 +/- 0.00
------------------------------------
| eval/                 |          |
|    mean_ep_length     | 1.12e+04 |
|    mean_reward        | 0        |
| time/                 |          |
|    total_timesteps    | 101016   |
| train/                |          |
|    entropy_loss       | -0.536   |
|    explained_variance | -0.129   |
|    learning_rate      | 0.0001   |
|    n_updates          | 1052     |
|    policy_loss        | -0.0753  |
|    value_loss         | 0.0317   |
------------------------------------

The tensorboard looks like this (I'm not worried about the actual values, but that it doesn't seem to be evaluating properly?):

image

....HOWEVER, when I don't use the eval_callback, and test the model once training is done, it does act in the eval env properly and takes actions like it should (and generates profit over recent 6 months of 5 minute chart trading btw :P). So I guess I'm still not sure what this eval callback is doing, and it doesn't look like it's "evaluating" anything, but the rollout mean rewards keep going up and up as training goes on. Maybe someone can explain what's happening here? :) I'm thinking it has something to do with not evaluating often enough? Maybe?

araffin commented 2 years ago

you should probably set deterministic to False.

windowshopr commented 2 years ago

you should probably set deterministic to False.

that helped! thanks!