Pechckin / MountainCar

Solution
5 stars 3 forks source link

Implementation details #1

Closed sramakrishnan247 closed 3 years ago

sramakrishnan247 commented 3 years ago

Thanks for sharing this implementation. I have a question regarding reward update in Q-learning. Why do you use a modified reward here: https://github.com/Pechckin/MountainCar/blob/6754a33eba78cacd1881f00737ae841aa279292e/MountainCarContinuous-v0.py#L69

Also, why do I see a (1 - alpha) factor in the Q update here: https://github.com/Pechckin/MountainCar/blob/6754a33eba78cacd1881f00737ae841aa279292e/MountainCarContinuous-v0.py#L72

I learnt Q-learning from RL Sutton's textbook.

Pechckin commented 3 years ago

Hi!

1)The modified reward is needed to speed up learning, it motivates the agent to accelerate. The speed difference is added to the usual reward, and the larger it is, the better. This is a known modification for this particular environment. 2)You can find this formula on wikipedia. But there is another version of the formula, you probably used it and it is also correct.

22 апр. 2021 г., в 04:05, Ramakrishnan Sundareswaran @.***> написал(а):

Thanks for sharing this implementation. I have a question regarding reward update in Q-learning. Why do you use a modified reward here: https://github.com/Pechckin/MountainCar/blob/6754a33eba78cacd1881f00737ae841aa279292e/MountainCarContinuous-v0.py#L69 https://github.com/Pechckin/MountainCar/blob/6754a33eba78cacd1881f00737ae841aa279292e/MountainCarContinuous-v0.py#L69 Also, why do I see a (1 - alpha) factor in the Q update here: https://github.com/Pechckin/MountainCar/blob/6754a33eba78cacd1881f00737ae841aa279292e/MountainCarContinuous-v0.py#L72 https://github.com/Pechckin/MountainCar/blob/6754a33eba78cacd1881f00737ae841aa279292e/MountainCarContinuous-v0.py#L72 I learnt Q-learning from RL Sutton's textbook.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Pechckin/MountainCar/issues/1, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEINANCQMPLQVUX753ZTY53TJ5Y7BANCNFSM43LNOKVA.

sramakrishnan247 commented 3 years ago

@Pechckin Can you provide a link to the formula? I couldn't find it on Wikipedia. The update that I use is does not converge but this method does. So I'm trying to understand what might be the reason.

Pechckin commented 3 years ago

https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5 https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5

You could also use

It should work.

23 апр. 2021 г., в 18:26, Ramakrishnan Sundareswaran @.***> написал(а):

@Pechckin https://github.com/Pechckin Can you provide a link to the formula? I couldn't find it on Wikipedia. The update that I use is does not converge but this method does. So I'm trying to understand what might be the reason.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Pechckin/MountainCar/issues/1#issuecomment-825735504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEINANHQ7EOHO6R4XO3VGI3TKGGTLANCNFSM43LNOKVA.

Pechckin commented 3 years ago

I just drop (a - alpha) and it still works. self.Q[state, action] = self.Q[state, action] + self.alpha (modified_reward + self.gamma np.max(self.Q[next_state]) - self.Q[state, action])

23 апр. 2021 г., в 18:32, Serge Kulesh @.***> написал(а):

https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5 https://medium.com/analytics-vidhya/q-learning-is-the-most-basic-form-of-reinforcement-learning-which-doesnt-take-advantage-of-any-8944e02570c5

You could also use

<Графика-2.png> It should work.

23 апр. 2021 г., в 18:26, Ramakrishnan Sundareswaran @. @.>> написал(а):

@Pechckin https://github.com/Pechckin Can you provide a link to the formula? I couldn't find it on Wikipedia. The update that I use is does not converge but this method does. So I'm trying to understand what might be the reason.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Pechckin/MountainCar/issues/1#issuecomment-825735504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEINANHQ7EOHO6R4XO3VGI3TKGGTLANCNFSM43LNOKVA.

sramakrishnan247 commented 3 years ago

@Pechckin What about modified reward? It doesn't work without modified reward. Did you get a chance to try that?

Pechckin commented 3 years ago

in reinforcement learning, it is possible to modify the reward as long as it does not change the task itself. As I wrote in readme, there is a link to the article - https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf

I also know an article on this topic, but it is in Russian - https://habr.com/ru/company/hsespb/blog/444428/ https://habr.com/ru/company/hsespb/blog/444428/ Here is a quote from there: "Our intuitive knowledge tells us that to drive up a hill you need to accelerate. The higher the speed, the closer the agent is to solving the problem. You can inform him about this, for example, by adding a speed module with a certain coefficient to the reward: modified_reward = reward + 10 * abs (new_state [1])»

"By changing the reward function, we changed the problem itself, will the solution we found for the new problem be good for the old problem?

To begin with, let's understand what "goodness" means in our case. Solving the problem, we are trying to find the optimal policy - one that maximizes the total reward per episode. In this case, we can replace the word "good" with the word "optimal", because that's what we are looking for. We also optimistically hope that sooner or later our DQN will find an optimal solution for the modified problem, and not get stuck in a local maximum. So, the question can be reformulated as follows: if we changed the reward function and changed the problem itself, will the optimal solution of the new problem found by us be optimal for the old problem?

As it turns out, we cannot provide such a guarantee in the general case. The answer depends on how exactly we changed the function of the reward, how it was arranged before and how the environment itself is arranged. Fortunately, there is an article, the authors of which investigated how changing the reward function affects the optimality of the solution found.»

"First, they found a whole class of “safe” changes that are based on the potential method:

Where Φ - potential, which depends only on the state. For such functions, the authors were able to prove that if the solution for the new problem is optimal, then for the old problem it is also optimal.

Secondly, the authors showed that for any other

there is such a problem, a reward function R and an optimal solution to the modified problem that this solution is not optimal for the original problem. This means that we cannot guarantee the goodness of the solution we have found if we use a change that is not based on the potential method.

Thus, the use of potential functions to modify the reward function can only change the convergence rate of the algorithm, but does not affect the final solution.

Now that we know how we can safely change the reward, let's try to modify the problem again, using the potential method instead of the naive heuristic:

23 апр. 2021 г., в 18:42, Ramakrishnan Sundareswaran @.***> написал(а):

@Pechckin https://github.com/Pechckin What about modified reward?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Pechckin/MountainCar/issues/1#issuecomment-825745212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEINANC54BCU7MQ23Y6FT63TKGINBANCNFSM43LNOKVA.

sramakrishnan247 commented 3 years ago

@Pechckin Thanks a lot. I will look into these resources.