RL-TRAIN: Universal Scoring Equation

detlefarend commented 2 years ago

The first approach

score = max( self._eval_num_done - self._eval_num_limit - self._eval_num_broken, 0) * mean(self._eval_max_reward) * self._eval_factor / self._eval_num_cycles

is a little bit complicated and carries the risk of premature stagnation at the beginning of the training because of term 1.

After a detailed team discussion we agreed to the following simplifed approach:

score = sum ( mean_reward_epi )

where mean_reward_epi is the mean reward of all cycles of an evaluation episode. It is set to 0 if the env goal was not reached.

See method RLTraining._close_evaluation()

budiatmadjajaWill commented 2 years ago

I think this is better for general use cases. At least it's more readable.

rizkydiprasetya commented 2 years ago

Mean reward only should be enough. Because, reward calculation is per cycle. So if the reward is high, the cycle is definitely low (most cases). But cartpole high reward = high cycle. So, mean reward should be enough. And I think no need to multiply it by eval_num_done. Because if there is no done, there is no reward. Environment is terminated either with goal and cycle limit and return as done state.

budiatmadjajaWill commented 2 years ago

Oh, so it will always return done? If that is so, this is basically mean reward. You are right.

rizkydiprasetya commented 2 years ago

It should be. Because the training is like that. I dont know about the evaluation.

detlefarend commented 2 years ago

Question: can we be sure that a reward is ALWAYS >=0 ?

budiatmadjajaWill commented 2 years ago

I don't think so..

detlefarend commented 2 years ago

Then the question is how we reward an evaluation where no episode terminated in a done state.

budiatmadjajaWill commented 2 years ago

Didn't we agree to just use the mean? When there's no done, then it'll not make it to the list.

detlefarend commented 2 years ago

In my understanding we agreed, that we compute the mean for every episode. And just in case that an episode terminated in a done state this mean value is summed up to the overall reward of the entire evaluation.

budiatmadjajaWill commented 2 years ago

Yes, then what's the problem?

detlefarend commented 2 years ago

What is the overall reward of the evaluation if no episode terminated in a done state?

budiatmadjajaWill commented 2 years ago

Maybe it should be 0 then

rizkydiprasetya commented 2 years ago

It doesn't matter. Either with done or not, the reward itself represents everything. "Done" just means something extra to the reward. On the reward function itself, if it reaches the goal, there will be an extra reward. So it should be doesn't matter.

If you want, just put addition to the calculation. So, if its done, give extra score.

detlefarend commented 2 years ago

That means, we just compute the mean reward over all eval cycles indepently of episodes terminated in done state or not?

rizkydiprasetya commented 2 years ago

So, basically mean([sum(reward_cycle_epi_1), sum(reward_cycle_epi_2), sum(reward_cycle_epi_3), .........])

rizkydiprasetya commented 2 years ago

Currently, I am thinking something in more general, because if we are based only on reward, then if the user changes the reward function, the score will be different, and will not be comparable with the previous reward function.

detlefarend commented 2 years ago

And why not just mean of all cycle rewards?

rizkydiprasetya commented 2 years ago

Why not? because an environment could have a discrete reward. For example, cart pole has discrete reward. If you take mean of all cycles then the score will be always 1.

rizkydiprasetya commented 2 years ago

Currently, I am thinking something in more general, because if we are based only on reward, then if the user changes the reward function, the score will be different, and will not be comparable with the previous reward function.

For this, I think we can skip. We will restrict only for the same reward function. Otherwise will be complicated. So, forget this.

detlefarend commented 2 years ago

Why not? because an environment could have a discrete reward. For example, cart pole has discrete reward. If you take mean of all cycles then the score will be always 1.

Öhm, no. We have an epi with 5 cycles and env rewards: 0, 1, 1, 0, 1 then the score is 3/5 = 0.6

rizkydiprasetya commented 2 years ago

No, Cartpole cannot have 0 reward, only 1. Please check. https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

rizkydiprasetya commented 2 years ago

And also it doesn't make sense at all, if we take only mean of all cycles. You need reward from the whole episode, not per cycle. So, it should be mean of reward episode.

detlefarend commented 2 years ago

No, Cartpole cannot have 0 reward, only 1. Please check. https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py

Ok wait: I am in a state s1 and carry out an action a that brings me to a subsequent state s2. I reward by definition s1, s2 -> r. And in case of cartpole it's always 1???

rizkydiprasetya commented 2 years ago

yes, you can check, and run the code if you want.

rizkydiprasetya commented 2 years ago

And also it doesn't make sense at all, if we take only mean of all cycles. You need reward from the whole episode, not per cycle. So, it should be mean of reward episode.

For example, Atari game, you don't take score from every cycle, or every action that you take, but the score is from the end of the game after game over.

detlefarend commented 2 years ago

That makes no sense and they described it here in a different way: https://gym.openai.com/envs/CartPole-v0/

rizkydiprasetya commented 2 years ago

That makes no sense and they described it here in a different way: https://gym.openai.com/envs/CartPole-v0/

Yeah anyway please read this logic.

And also it doesn't make sense at all, if we take only mean of all cycles. You need reward from the whole episode, not per cycle. So, it should be mean of reward episode.

For example, Atari game, you don't take score from every cycle, or every action that you take, but the score is from the end of the game after game over.

rizkydiprasetya commented 2 years ago

That makes no sense and they described it here in a different way: https://gym.openai.com/envs/CartPole-v0/

This is old, should be CartPole-v1

detlefarend commented 2 years ago

The description of v0 is right. The reward is 1 only if the pole is in tolerance. Here's the original problem definition of Sutton, Barto 1983 (doi 10.1109/tsmc.1983.6313077):

grafik

detlefarend commented 2 years ago

Ok, my mistake! The reward is 1 if the pole is in tolerance. Otherwise the episode terminates.

rizkydiprasetya commented 2 years ago

Again, another analogy, if you play an arcade game, do you take the high score from every action that you take, or at the end of the game?

detlefarend commented 2 years ago

Ok, back to sub-scoring every evaluation episode.

Currently I think, the evaluation strategy depends on the env.

Cartpole: Try to keep the pole vertical. The longer the better. Means: more cycles are better.

General control loop: Try to eliminate to control error as fast as possible. Means: more cycles are worse.

rizkydiprasetya commented 2 years ago

So, the scoring, if you take only the reward, then you don't have to worry anymore about which env. Reward itself already represents everything.

So, I will stick with mean of reward episode.

detlefarend commented 2 years ago

Ok, we sum up the reward of an episode and thats the score of it.

An episode ends after the predefined cycle limit or if the env has broken (broken=True). We don't end the episode if the env goal has reached (done=True).

Then the score of an evaluation could be the mean of the episode scores or even just the sum. Not sure what's better.

Actually I am not sure any more about separation of training and evaluation. Evaluation takes time and we can score training episodes in the same way. Progress/stagnation can be detected on the scores of the training episodes as well. We can track the mean score and it's rate of increase. Not that easy as to track it on separate evaluation episodes but much more performant.

Currently I think separate evaluation episodes make not really sense. Do you have real pros for it or shall we give up this approach?

See also #222

budiatmadjajaWill commented 2 years ago

To be honest, having a separate evaluation is always good as it isolates the information. The training log is good for a simple glance, but the result noted isn't "stable". It might just be that the state is high reward by chance (noise).

I would still go for the things we have discussed. The problem is in deciding the universal way of scoring as different environment also have different rewarding method.

Personally I'd say the most general way to see this is to use the mean of the reward.

rizkydiprasetya commented 2 years ago

Yes, sum up the reward of an episode. And if we have more than one episode, take the mean of it.

It doesnt make sense if you only sum up the reward. Since evaluation consist of more than one episode.

Evaluation is important when you uave different datasets. In supervised learning, you divide dataset into train and test. The same with Reinforcement learning. For example it can have train objective and test objective. Or even with different seed. We separate train seed and test seed. So, we will have different set of state for train and test.

But evaluation is not necessary, since training itself already enough. But, its a great feature.

detlefarend commented 2 years ago

Hi colleagues,

to avoid misunderstandings in the future I think it is better to replace the word done by success in context of mlpro, where success is an indicator that the env goal has been reached. It doesn't end an episode while it's just an indicator.

Sry, it's an incompatible change but I really want to make clear that our success is NOT the same as Gym's done. Gym's done is either an indicator for a broken env or that the maximum number of cylces/steps has been reached.

rizkydiprasetya commented 2 years ago

Careful, if it doesn't end an episode, it could result in overfit for certain environment.

detlefarend commented 2 years ago

If we end an episode on success then we lower the episodical reward sum. And I guess that Gym doesn't end an episode on success. Otherwise cartpole would always terminate after 1 step because cartpole just knows success or broken.

detlefarend commented 2 years ago

On wrapper level - by just using the interface of class gym.env - we can't differentiate whether an env has run into a broken or success state. If the env terminates before reaching the cycle limit per episode then we must assume that the env has broken.

Do you agree?

rizkydiprasetya commented 2 years ago

As I said, in my experience, it could result in overfit. I don't say any specific environment. If you want to implement it, you can still. I don't say, you cannot implement it.

rizkydiprasetya commented 2 years ago

As long as it is not restricted to always don't end an episode on success, it is ok for me.

rizkydiprasetya commented 2 years ago

And I guess that Gym doesn't end an episode on success.

Please check: https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py

If we end an episode on success then we lower the episodical reward sum.

Depends on the reward structure, if you are talking cartpole, yes, since every reward is positive. But there is an environment that give penalty for each step.

I think what I said before is too far away on the neural network training. But, if you restrict to always don't end on every success, first, in the neural network could result in overfit, and second, not all environment are like that.

So, in my opinion, keep implement on that feature, but don't restrict it to always. And also it is a great feature to have success status, can be useful for the report later.

detlefarend commented 2 years ago

Suggestion: we add a further parameter p_end_epi_on_success (True/False) to class RLTraining.

But one problem remains: it can lower the score if the agent becomes better.

Example:

Episode 1: 5 cycles (no success, no broken) Rewards 1, 2, 1, 2, 1 Sum 7

Episode 2: 3 cycles (then success) Rewards 1, 2, 3 Sum 6

How can we deal with this?

rizkydiprasetya commented 2 years ago

Then your reward structure is bad.

rizkydiprasetya commented 2 years ago

In RL, reward structure is also one important thing. If the user doesnt struct it in a good way, then either it cant learn or just result in bad report.

rizkydiprasetya commented 2 years ago

So, basically, it is not our problem.

fhswf / MLPro

RL-TRAIN: Universal Scoring Equation #212