Selection Policy question

PhilippeMorere / BasicReinforcementLearning

Simple Reinforcement learning example, based on the Q-function.

15 stars 10 forks source link

Selection Policy question #1

Closed eLRuLL closed 9 years ago

eLRuLL commented 9 years ago

hi, which selection policy do you use? greedy, e-greedy or softmax? and where is that part on your code?, so for example where should I check to change that method.

Thank you.

PhilippeMorere commented 9 years ago

Hi, I'm using a Qlearning type of reinforcement learning. The policy is P(s) = argmax{a}(Q(s,a)) It's a greedy algorithm and there is no random action taken. If you want to have a look at the core function, check the "while True" of the run function.

eLRuLL commented 9 years ago

Thank you for your fast answer, so your training algorithm is the one on inc_Q and your policy on max_Q right?

PhilippeMorere commented 9 years ago

inc_Q corresponds to the operation Q(s, a) <- Q(s, a)*(1-alpha) + alpha * increment. (Note that I forgot to put the factor (1-alpha) in my code). Where alpha is the learning rate, and increment is computed as r + discount * max_val. Where r is the reward, discount is the discount factor (called gamma sometimes), and max_val is max{a'}(Q(s',a'))

All in all, I am just solving the classic Q-learning equations. If you're looking for a lecture about it, check the one from Littman on Udacity.

eLRuLL commented 9 years ago

So for example, checking the Q-Learning algorithm here Q-Learning, I see that for formula is:

Q(s, a) <- Q(s, a) + alpha[r + discount * max{a'}(Q(s',a')) - Q(s, a)]

but I checked that your formula is:

Q(s, a) <- Q(s, a) + alpha[r + discount * max{a'}(Q(s',a'))]

So you omit the subtraction of the current state - Q(s, a), could you explain me why, and how this affects the results?

PhilippeMorere commented 9 years ago

Exactly, I didnt substract the Q(s, a) component, as mentioned in my previous comment. The formula Q(s, a) <- Q(s, a)* (1 - alpha) + alpha[r + discount * max{a'}(Q(s',a'))] would also be valid. This is an error from me and actually has a bad effect on the algorithm. If you print the Q values after many iterations, you'll see that they don't converge. I should fix it.

EDIT: It's fixed!