Closed detlefarend closed 2 years ago
I think this is better for general use cases. At least it's more readable.
Mean reward only should be enough. Because, reward calculation is per cycle. So if the reward is high, the cycle is definitely low (most cases). But cartpole high reward = high cycle. So, mean reward should be enough. And I think no need to multiply it by eval_num_done. Because if there is no done, there is no reward. Environment is terminated either with goal and cycle limit and return as done state.
Oh, so it will always return done? If that is so, this is basically mean reward. You are right.
It should be. Because the training is like that. I dont know about the evaluation.
Question: can we be sure that a reward is ALWAYS >=0 ?
I don't think so..
Then the question is how we reward an evaluation where no episode terminated in a done state.
Didn't we agree to just use the mean? When there's no done, then it'll not make it to the list.
In my understanding we agreed, that we compute the mean for every episode. And just in case that an episode terminated in a done state this mean value is summed up to the overall reward of the entire evaluation.
Yes, then what's the problem?
What is the overall reward of the evaluation if no episode terminated in a done state?
Maybe it should be 0 then
It doesn't matter. Either with done or not, the reward itself represents everything. "Done" just means something extra to the reward. On the reward function itself, if it reaches the goal, there will be an extra reward. So it should be doesn't matter.
If you want, just put addition to the calculation. So, if its done, give extra score.
That means, we just compute the mean reward over all eval cycles indepently of episodes terminated in done state or not?
So, basically mean([sum(reward_cycle_epi_1), sum(reward_cycle_epi_2), sum(reward_cycle_epi_3), .........])
Currently, I am thinking something in more general, because if we are based only on reward, then if the user changes the reward function, the score will be different, and will not be comparable with the previous reward function.
And why not just mean of all cycle rewards?
Why not? because an environment could have a discrete reward. For example, cart pole has discrete reward. If you take mean of all cycles then the score will be always 1.
Currently, I am thinking something in more general, because if we are based only on reward, then if the user changes the reward function, the score will be different, and will not be comparable with the previous reward function.
For this, I think we can skip. We will restrict only for the same reward function. Otherwise will be complicated. So, forget this.
Why not? because an environment could have a discrete reward. For example, cart pole has discrete reward. If you take mean of all cycles then the score will be always 1.
Öhm, no. We have an epi with 5 cycles and env rewards: 0, 1, 1, 0, 1 then the score is 3/5 = 0.6
No, Cartpole cannot have 0 reward, only 1. Please check. https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
And also it doesn't make sense at all, if we take only mean of all cycles. You need reward from the whole episode, not per cycle. So, it should be mean of reward episode.
No, Cartpole cannot have 0 reward, only 1. Please check. https://github.com/openai/gym/blob/master/gym/envs/classic_control/cartpole.py
Ok wait: I am in a state s1 and carry out an action a that brings me to a subsequent state s2. I reward by definition s1, s2 -> r. And in case of cartpole it's always 1???
yes, you can check, and run the code if you want.
And also it doesn't make sense at all, if we take only mean of all cycles. You need reward from the whole episode, not per cycle. So, it should be mean of reward episode.
For example, Atari game, you don't take score from every cycle, or every action that you take, but the score is from the end of the game after game over.
That makes no sense and they described it here in a different way: https://gym.openai.com/envs/CartPole-v0/
That makes no sense and they described it here in a different way: https://gym.openai.com/envs/CartPole-v0/
Yeah anyway please read this logic.
And also it doesn't make sense at all, if we take only mean of all cycles. You need reward from the whole episode, not per cycle. So, it should be mean of reward episode.
For example, Atari game, you don't take score from every cycle, or every action that you take, but the score is from the end of the game after game over.
That makes no sense and they described it here in a different way: https://gym.openai.com/envs/CartPole-v0/
This is old, should be CartPole-v1
The description of v0 is right. The reward is 1 only if the pole is in tolerance. Here's the original problem definition of Sutton, Barto 1983 (doi 10.1109/tsmc.1983.6313077):
Ok, my mistake! The reward is 1 if the pole is in tolerance. Otherwise the episode terminates.
Again, another analogy, if you play an arcade game, do you take the high score from every action that you take, or at the end of the game?
Ok, back to sub-scoring every evaluation episode.
Currently I think, the evaluation strategy depends on the env.
Cartpole: Try to keep the pole vertical. The longer the better. Means: more cycles are better.
General control loop: Try to eliminate to control error as fast as possible. Means: more cycles are worse.
So, the scoring, if you take only the reward, then you don't have to worry anymore about which env. Reward itself already represents everything.
So, I will stick with mean of reward episode.
Ok, we sum up the reward of an episode and thats the score of it.
An episode ends after the predefined cycle limit or if the env has broken (broken=True). We don't end the episode if the env goal has reached (done=True).
Then the score of an evaluation could be the mean of the episode scores or even just the sum. Not sure what's better.
Actually I am not sure any more about separation of training and evaluation. Evaluation takes time and we can score training episodes in the same way. Progress/stagnation can be detected on the scores of the training episodes as well. We can track the mean score and it's rate of increase. Not that easy as to track it on separate evaluation episodes but much more performant.
Currently I think separate evaluation episodes make not really sense. Do you have real pros for it or shall we give up this approach?
See also #222
To be honest, having a separate evaluation is always good as it isolates the information. The training log is good for a simple glance, but the result noted isn't "stable". It might just be that the state is high reward by chance (noise).
I would still go for the things we have discussed. The problem is in deciding the universal way of scoring as different environment also have different rewarding method.
Personally I'd say the most general way to see this is to use the mean of the reward.
Yes, sum up the reward of an episode. And if we have more than one episode, take the mean of it.
It doesnt make sense if you only sum up the reward. Since evaluation consist of more than one episode.
Evaluation is important when you uave different datasets. In supervised learning, you divide dataset into train and test. The same with Reinforcement learning. For example it can have train objective and test objective. Or even with different seed. We separate train seed and test seed. So, we will have different set of state for train and test.
But evaluation is not necessary, since training itself already enough. But, its a great feature.
Hi colleagues,
to avoid misunderstandings in the future I think it is better to replace the word done by success in context of mlpro, where success is an indicator that the env goal has been reached. It doesn't end an episode while it's just an indicator.
Sry, it's an incompatible change but I really want to make clear that our success is NOT the same as Gym's done. Gym's done is either an indicator for a broken env or that the maximum number of cylces/steps has been reached.
Careful, if it doesn't end an episode, it could result in overfit for certain environment.
If we end an episode on success then we lower the episodical reward sum. And I guess that Gym doesn't end an episode on success. Otherwise cartpole would always terminate after 1 step because cartpole just knows success or broken.
On wrapper level - by just using the interface of class gym.env - we can't differentiate whether an env has run into a broken or success state. If the env terminates before reaching the cycle limit per episode then we must assume that the env has broken.
Do you agree?
As I said, in my experience, it could result in overfit. I don't say any specific environment. If you want to implement it, you can still. I don't say, you cannot implement it.
As long as it is not restricted to always don't end an episode on success, it is ok for me.
And I guess that Gym doesn't end an episode on success.
Please check: https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py
If we end an episode on success then we lower the episodical reward sum.
Depends on the reward structure, if you are talking cartpole, yes, since every reward is positive. But there is an environment that give penalty for each step.
I think what I said before is too far away on the neural network training. But, if you restrict to always don't end on every success, first, in the neural network could result in overfit, and second, not all environment are like that.
So, in my opinion, keep implement on that feature, but don't restrict it to always. And also it is a great feature to have success status, can be useful for the report later.
Suggestion: we add a further parameter p_end_epi_on_success (True/False) to class RLTraining.
But one problem remains: it can lower the score if the agent becomes better.
Example:
Episode 1: 5 cycles (no success, no broken) Rewards 1, 2, 1, 2, 1 Sum 7
Episode 2: 3 cycles (then success) Rewards 1, 2, 3 Sum 6
How can we deal with this?
Then your reward structure is bad.
In RL, reward structure is also one important thing. If the user doesnt struct it in a good way, then either it cant learn or just result in bad report.
So, basically, it is not our problem.
The first approach
score = max( self._eval_num_done - self._eval_num_limit - self._eval_num_broken, 0) * mean(self._eval_max_reward) * self._eval_factor / self._eval_num_cycles
is a little bit complicated and carries the risk of premature stagnation at the beginning of the training because of term 1.
After a detailed team discussion we agreed to the following simplifed approach:
score = sum ( mean_reward_epi )
where
mean_reward_epi
is the mean reward of all cycles of an evaluation episode. It is set to 0 if the env goal was not reached.See method RLTraining._close_evaluation()