dimitri-rusin / oll_onemax

1 stars 0 forks source link

Visualising and analysis #11

Open ndangtt opened 7 months ago

ndangtt commented 7 months ago

Two performance metrics for quantifying performance of an RL training process:

(1) #hitting times: during the training, we evaluate the currently trained policy at every k (=2000) time steps. A "good" policy is defined as one that has the expected performance within mean + 25%std, where mean and std is the mean and standard deviation of the performance of the optimal policy. The #hitting times is the number of times where we reach a "good" policy during the training.

A normalised version of this is the hitting ratio (in [0,1]), defined as #hitting times / # policy evaluations.

(2) Area under the learning curve: the AUC under the learning curve of the RL training process (using the evaluation performance). This is complementary to #hitting times as it can give an idea on the learning convergence speed and whether there's lot of fluctuations.

dimitri-rusin commented 7 months ago

I don't think, we want the hitting times to be a large number for our RL training process. This would mean that we frequently jump back and forth from the theory-derived policy and again towards it.

Instead, ideally we want to never leave the proximity of the theory-derived policy once we get mean + 25% stddev close to it.

What we learn want to measure, I think, is the best policy ever seen so far at each time step of policy evaluation. What is the policy we ever found at this timestep? This is not necessarily the most recently evaluated one. This would, by definition of course, give a very smoothly decreasing curve.

And if we're making a diagram containing info about all experiment settings, I propose to just plot for each experiment setting three things: best policy ever found + mean performance of any policy + starting policy, all arranged as three points vertically on top of each other like in a boxplot.

What do you think?

ndangtt commented 7 months ago

I don't think, we want the hitting times to be a large number for our RL training process. This would mean that we frequently jump back and forth from the theory-derived policy and again towards it.

--> it does not necessarily means that we jump back and forth. If an RL training converges to the optimal policy during the training, every evaluation point during the convergence will be counted as a hit, and this is what we want.

What we learn want to measure, I think, is the best policy ever seen so far at each time step of policy evaluation. What is the policy we ever found at this timestep? This is not necessarily the most recently evaluated one. This would, by definition of course, give a very smoothly decreasing curve.

--> I think we want to monitor the training progress based on the currently trained policy at each evaluation point rather than the best policy so far. That will tell us more information about whether the learning is stable or if it fluctuates a lot.

And if we're making a diagram containing info about all experiment settings, I propose to just plot for each experiment setting three things: best policy ever found + mean performance of any policy + starting policy, all arranged as three points vertically on top of each other like in a boxplot.

--> instead of doing that, we could plot the performance of all evaluated policies during an RL training as a boxplot (or violin plot). Note that while the best policy ever found gives us some information about the learning, it is not a practical choice in practice. In reality, we may not want to waste the environment sampling budget on frequent evaluation of the current policy (which is needed for identifying the best one). It would be more sample efficient to take the policy at the end of the training horizon. Therefore, we want an RL algorithm with effective and robust behaviour, i.e., it should converge at the end of the training.