Closed eLRuLL closed 9 years ago
Hi, I'm using a Qlearning type of reinforcement learning. The policy is P(s) = argmax{a}(Q(s,a)) It's a greedy algorithm and there is no random action taken. If you want to have a look at the core function, check the "while True" of the run function.
Thank you for your fast answer, so your training algorithm is the one on inc_Q
and your policy on max_Q
right?
inc_Q corresponds to the operation Q(s, a) <- Q(s, a)*(1-alpha) + alpha * increment
. (Note that I forgot to put the factor (1-alpha)
in my code).
Where alpha
is the learning rate, and increment
is computed as r + discount * max_val
.
Where r
is the reward, discount
is the discount factor (called gamma
sometimes), and max_val
is max{a'}(Q(s',a'))
All in all, I am just solving the classic Q-learning equations. If you're looking for a lecture about it, check the one from Littman on Udacity.
So for example, checking the Q-Learning algorithm here Q-Learning, I see that for formula is:
Q(s, a) <- Q(s, a) + alpha[r + discount * max{a'}(Q(s',a')) - Q(s, a)]
but I checked that your formula is:
Q(s, a) <- Q(s, a) + alpha[r + discount * max{a'}(Q(s',a'))]
So you omit the subtraction of the current state - Q(s, a)
, could you explain me why, and how this affects the results?
Exactly, I didnt substract the Q(s, a)
component, as mentioned in my previous comment.
The formula Q(s, a) <- Q(s, a)* (1 - alpha) + alpha[r + discount * max{a'}(Q(s',a'))]
would also be valid.
This is an error from me and actually has a bad effect on the algorithm. If you print the Q values after many iterations, you'll see that they don't converge.
I should fix it.
EDIT: It's fixed!
hi, which selection policy do you use? greedy, e-greedy or softmax? and where is that part on your code?, so for example where should I check to change that method.
Thank you.