Train one observation at a time

alpha-wavelet commented 7 months ago

Thank you for the RLMatrix.

I have a complex environment that cannot be simulated, in which interdependent observations are generated one at a time, so running training on an entire episode at once is not an option. I had to add a TrainObservation() based on your TrainEpisode() code. Please add this minor fix.

Thanks

asieradzk commented 5 months ago

Hey.

Thanks for trying RLMatrix.

Sorry for the late reply I was actually holidaying and I have some more features that I will be pumping into RL Matrix in the next couple weeks.

Could you show some more examples of how you're using it so I understand better what features to add?

If I understand correctly: You want to be able to pool environment every now and then for observation? Or perhaps you want to have buffer of observations & actions from realtime?

Some code would be great!

alpha-wavelet commented 5 months ago

Hi Adrian,

I am implementing RLMatrix in the NinjaTrader trading platform. NinjaTrader provides a C# environment for writing indicators and trading strategies. NinjaTrader provides an event driven environment, e.g., OnBarUpdate() where a strategy can Buy or Sell a security. Even though buying and selling may sound simple, it is not. I would not chance on trying to simulate it. So execution has to advance one price change at a time. For that I modified your code a little to process each observation. I am using the previous release of the RLMatrix since I could not get the latest release to work.

PPOAgent:

T state;
List<Transition<T>> transitionsInEpisode;
bool initial = true;
float cumulativeReward;
public void TrainObservation()
{
    if (initial)
    {
        initial = false;
        episodeCounter++;
        // Initialize the environment and get its state
        myEnvironment.Reset();
        state = DeepCopy(myEnvironment.GetCurrentState());
        cumulativeReward = 0;
        transitionsInEpisode = new List<Transition<T>>();
    }

    // Select an action based on the policy
    (int[], float[]) action = SelectAction(state);
    // Take a step using the selected action
    float reward = myEnvironment.Step(action.Item1, action.Item2);
    // Check if the episode is done
    var done = myEnvironment.isDone;

    T nextState;
    if (done)
    {
        // If done, there is no next state
        nextState = default;
        initial = true;
    }
    else
    {
        // If not done, get the next state
        nextState = DeepCopy(myEnvironment.GetCurrentState());
    }

    if (state == null)
        Console.WriteLine("state is null");

    // Store the transition in temporary memory
    transitionsInEpisode.Add(new Transition<T>(state, action.Item1, action.Item2, reward, nextState));

    cumulativeReward += reward;
    // If not done, move to the next state
    if (!done)
    {
        state = nextState;
    }
    else
    {
        foreach (var item in transitionsInEpisode)
        {
            myReplayBuffer.Push(item);
        }

        OptimizeModel();

        //TODO: hardcoded chart
        episodeRewards.Add(cumulativeReward);

        if (myOptions.DisplayPlot != null)
        {
            myOptions.DisplayPlot.CreateOrUpdateChart(episodeRewards);
        }
    }
}

DQNAgent:

T state;
bool initial = true;
float cumulativeReward;
public void TrainObservation()
{
    if (initial)
    {
        initial = false;
        episodeCounter++;
        // Initialize the environment and get its state
        myEnvironment.Reset();
        state = DeepCopy(myEnvironment.GetCurrentState());
        cumulativeReward = 0;
    }

    // Select an action based on the policy
    var action = SelectAction(state);
    // Take a step using the selected action
    var reward = myEnvironment.Step(action);
    // Check if the episode is done
    var done = myEnvironment.isDone;

    T nextState;
    if (done)
    {
        // If done, there is no next state
        nextState = default;
        initial = true;
    }
    else
    {
        // If not done, get the next state
        nextState = DeepCopy(myEnvironment.GetCurrentState());
    }

    if (state == null)
        Console.WriteLine("state is null");

    // Store the transition in temporary memory
    myReplayBuffer.Push(new Transition<T>(state, action, null, reward, nextState));

    cumulativeReward += reward;
    // If not done, move to the next state
    if (!done)
    {
        state = nextState;
    }
    // Perform one step of the optimization (on the policy network)
    OptimizeModel();

    // Soft update of the target network's weights
    // θ′ ← τ θ + (1 −τ )θ′
    SoftUpdateTargetNetwork();

    if (done)
    {
        episodeRewards.Add(cumulativeReward);
        if (myOptions.DisplayPlot != null)
        {
            myOptions.DisplayPlot.CreateOrUpdateChart(episodeRewards);
        }
    }
}

It all runs, but the results are random. First of all, running on CPU uses only 2 threads, and GPU is even slower. DQN is too slow. On average a test may have 30K events on 200K observations. Reloading previously trained agent and running training on the same data produces random results. No improvement at all. PPO tends to quit very early and I tried different parameters. In general, both DQN and PPO do not respond to feedback. PPO gets stuck trying the same action.

Thank you

asieradzk commented 5 months ago

Okay so I've updated the nuget packages and github repository to the newest version of RLMatrix

This time is like you required - one step of the environment at the time, better yet we can step any number of environments simultaneously. I've updated examples so you can have a look at the code there. It doesn't change much.

var envppo = new List<IEnvironment<float[]>> { new CartPole(), new CartPole() };
var myAgentppo = new PPOAgent<float[]>(optsppo, envppo);

for (int i = 0; i < 10000; i++)
{
    myAgentppo.Step();
}

Let me know if this works for you :)

On another note its great you are trying to use deep reinforcement learning for stock trading. I know many academics are working on this difficult task and my adventure with deep learning also started with trying to use it for crypto trading. Keep in mind this is going to be daunting task, I would suggest you have a look first at examples where reinforcement learning was used successfully to win at poker.

https://www.science.org/doi/10.1126/science.aay2400

alpha-wavelet commented 5 months ago

Thank you for the update and the article.

I already have a GBM (on top of other tools) to forecast market at over 80% accuracy. The RL layer on top is to make trades, a task for which it is more suitable.

asieradzk commented 5 months ago

Thank you for the update and the article.

I already have a GBM (on top of other tools) to forecast market at over 80% accuracy. The RL layer on top is to make trades, a task for which it is more suitable.

In that case sounds like a good use case. Hope it works out for :) I will close the issue now but feel free to contact me or open new one anytime when you need help setting something up with RL Matrix I am happy to help.

asieradzk / RL_Matrix

Train one observation at a time #3