Closed ds5678 closed 7 months ago
Hey. There are couple issues with how you use the library:
You should end episode at some point, max steps should be something between 1 and (not too large) int. For infinite episodes you must learn how to spoof them as finite episodes. Your agent never learns because episodes is never finished you should set the "done" flag to true Reset will be called
For ppo best to use some batch size
private readonly float[] state = [RandomValue()]; means that network and your "Step" method will see different state so reward is assigned randomly, better assign it in reset or per step basis
Having "0f" value for observations can lead to issues better to start at 1
Im not sure your "trivial" env is very trivial you single-step episodes and it may take a lot of hyperparamter tuning for the network to get something useful out of signals.
I gave it a go and have some issues - maybe I broke something without realising. Give me a day
Here, this is the correct way to set up the env, with the tips above.
Be sure to grab newest nuget: 0.2.1 - I may or may not have broken some things in previous version :)
public class TrivialEnvironment : IEnvironment<float[]>
{
public const float CorrectAnswerReward = 1;
public const float WrongAnswerPenalty = -1;
public float[] state;
public static int RandomValue()
{
return Random.Shared.Next(2);
}
public int stepCounter { get; set; }
public int maxSteps { get; set; } = 10;
public bool isDone { get; set; }
public OneOf<int, (int, int)> stateSize { get; set; } = 1;
public int[] actionSize { get; set; } = [2];
public float[] GetCurrentState() => state;
public TrivialEnvironment() => Initialise();
public void Initialise() => Reset();
public void Reset()
{
state = new float[1] { RandomValue() };
stepCounter = 0;
isDone = false;
}
public float Step(int[] actionsIds)
{
float input = state[0];
float output = actionsIds[0];
state[0] = RandomValue();
if (stepCounter++ >= maxSteps)
{
isDone = true;
}
return input == output ? CorrectAnswerReward : WrongAnswerPenalty;
}
}
Context
I was having difficulty using reinforcement learning for more complex problems, so I made a simple project to test the learning potential. In the code below, I am trying to teach the Boolean identity function to the agent, but the agent gives random answers before and after training.
Code
Am I using the library incorrectly?