Closed alpha-wavelet closed 10 months ago
Hey.
Thanks for trying RLMatrix.
Sorry for the late reply I was actually holidaying and I have some more features that I will be pumping into RL Matrix in the next couple weeks.
Could you show some more examples of how you're using it so I understand better what features to add?
If I understand correctly: You want to be able to pool environment every now and then for observation? Or perhaps you want to have buffer of observations & actions from realtime?
Some code would be great!
Hi Adrian,
I am implementing RLMatrix in the NinjaTrader trading platform. NinjaTrader provides a C# environment for writing indicators and trading strategies. NinjaTrader provides an event driven environment, e.g., OnBarUpdate() where a strategy can Buy or Sell a security. Even though buying and selling may sound simple, it is not. I would not chance on trying to simulate it. So execution has to advance one price change at a time. For that I modified your code a little to process each observation. I am using the previous release of the RLMatrix since I could not get the latest release to work.
PPOAgent:
T state;
List<Transition<T>> transitionsInEpisode;
bool initial = true;
float cumulativeReward;
public void TrainObservation()
{
if (initial)
{
initial = false;
episodeCounter++;
// Initialize the environment and get its state
myEnvironment.Reset();
state = DeepCopy(myEnvironment.GetCurrentState());
cumulativeReward = 0;
transitionsInEpisode = new List<Transition<T>>();
}
// Select an action based on the policy
(int[], float[]) action = SelectAction(state);
// Take a step using the selected action
float reward = myEnvironment.Step(action.Item1, action.Item2);
// Check if the episode is done
var done = myEnvironment.isDone;
T nextState;
if (done)
{
// If done, there is no next state
nextState = default;
initial = true;
}
else
{
// If not done, get the next state
nextState = DeepCopy(myEnvironment.GetCurrentState());
}
if (state == null)
Console.WriteLine("state is null");
// Store the transition in temporary memory
transitionsInEpisode.Add(new Transition<T>(state, action.Item1, action.Item2, reward, nextState));
cumulativeReward += reward;
// If not done, move to the next state
if (!done)
{
state = nextState;
}
else
{
foreach (var item in transitionsInEpisode)
{
myReplayBuffer.Push(item);
}
OptimizeModel();
//TODO: hardcoded chart
episodeRewards.Add(cumulativeReward);
if (myOptions.DisplayPlot != null)
{
myOptions.DisplayPlot.CreateOrUpdateChart(episodeRewards);
}
}
}
DQNAgent:
T state;
bool initial = true;
float cumulativeReward;
public void TrainObservation()
{
if (initial)
{
initial = false;
episodeCounter++;
// Initialize the environment and get its state
myEnvironment.Reset();
state = DeepCopy(myEnvironment.GetCurrentState());
cumulativeReward = 0;
}
// Select an action based on the policy
var action = SelectAction(state);
// Take a step using the selected action
var reward = myEnvironment.Step(action);
// Check if the episode is done
var done = myEnvironment.isDone;
T nextState;
if (done)
{
// If done, there is no next state
nextState = default;
initial = true;
}
else
{
// If not done, get the next state
nextState = DeepCopy(myEnvironment.GetCurrentState());
}
if (state == null)
Console.WriteLine("state is null");
// Store the transition in temporary memory
myReplayBuffer.Push(new Transition<T>(state, action, null, reward, nextState));
cumulativeReward += reward;
// If not done, move to the next state
if (!done)
{
state = nextState;
}
// Perform one step of the optimization (on the policy network)
OptimizeModel();
// Soft update of the target network's weights
// θ′ ← τ θ + (1 −τ )θ′
SoftUpdateTargetNetwork();
if (done)
{
episodeRewards.Add(cumulativeReward);
if (myOptions.DisplayPlot != null)
{
myOptions.DisplayPlot.CreateOrUpdateChart(episodeRewards);
}
}
}
It all runs, but the results are random. First of all, running on CPU uses only 2 threads, and GPU is even slower. DQN is too slow. On average a test may have 30K events on 200K observations. Reloading previously trained agent and running training on the same data produces random results. No improvement at all. PPO tends to quit very early and I tried different parameters. In general, both DQN and PPO do not respond to feedback. PPO gets stuck trying the same action.
Thank you
Okay so I've updated the nuget packages and github repository to the newest version of RLMatrix
This time is like you required - one step of the environment at the time, better yet we can step any number of environments simultaneously. I've updated examples so you can have a look at the code there. It doesn't change much.
var envppo = new List<IEnvironment<float[]>> { new CartPole(), new CartPole() };
var myAgentppo = new PPOAgent<float[]>(optsppo, envppo);
for (int i = 0; i < 10000; i++)
{
myAgentppo.Step();
}
Let me know if this works for you :)
On another note its great you are trying to use deep reinforcement learning for stock trading. I know many academics are working on this difficult task and my adventure with deep learning also started with trying to use it for crypto trading. Keep in mind this is going to be daunting task, I would suggest you have a look first at examples where reinforcement learning was used successfully to win at poker.
Thank you for the update and the article.
I already have a GBM (on top of other tools) to forecast market at over 80% accuracy. The RL layer on top is to make trades, a task for which it is more suitable.
Thank you for the update and the article.
I already have a GBM (on top of other tools) to forecast market at over 80% accuracy. The RL layer on top is to make trades, a task for which it is more suitable.
In that case sounds like a good use case. Hope it works out for :) I will close the issue now but feel free to contact me or open new one anytime when you need help setting something up with RL Matrix I am happy to help.
Thank you for the RLMatrix.
I have a complex environment that cannot be simulated, in which interdependent observations are generated one at a time, so running training on an entire episode at once is not an option. I had to add a TrainObservation() based on your TrainEpisode() code. Please add this minor fix.
Thanks