Closed justin545 closed 5 years ago
I think it has been over 1.5 years since I worked on this. But as far as I remember, the states are basically images from the game-screen. The neural network is used to estimate q-values from those images which are then used to take actions. Depending on the action taken and the resulting scores, the q-values are updated. You could have different outcomes from the same states if the environment or the later actions are stochastic. When we batch-train the neural network using many samples whose input-values (states) are identical but whose output-values (q-values) are different, the network probably learns an average of those q-values. I don't see a way to do it differently. How would you identify which q-value is correct (maybe the max)? How would you implement this filtering of the replay memory efficiently?
I'm closing this issue but you're very welcome to experiment with the code and report back what you found out, in case someone in the future has a similar question.
As far as I know,
reinforcement_learning.py
keeps a history of observed states in classReplayMemory
's memberstates[]
and keeps the corresponding Q values in memberq_values[]
of the same class. Besides, I think there could be a chance that there are some equal states being observed at different time of game playing once the game is repeated enough, which means the same state could appear multiple times at different indices ofstates[]
.I found there are two lines to update Q values in the code :
Suppose three states
self.states[2]
,self.states[5]
andself.states[7]
are equal (verified bynp.array_equal()
). Because equal states should refer to the same Q values,self.q_values[2, action]
,self.q_values[5, action]
andself.q_values[7, action]
should have the same values after the update. But it seems that the two lines of code only updatesself.q_values[7, action]
and keepsself.q_values[2, action]
andself.q_values[5, action]
intact whenk=7
.So do I misunderstand the code and does it matter?