Hvass-Labs / TensorFlow-Tutorials

TensorFlow Tutorials with YouTube Videos
MIT License
9.28k stars 4.19k forks source link

Tutorial 16: Should equal states in the replay memory share the same Q values? #92

Closed justin545 closed 5 years ago

justin545 commented 5 years ago

As far as I know, reinforcement_learning.py keeps a history of observed states in class ReplayMemory's member states[] and keeps the corresponding Q values in member q_values[] of the same class. Besides, I think there could be a chance that there are some equal states being observed at different time of game playing once the game is repeated enough, which means the same state could appear multiple times at different indices of states[].

I found there are two lines to update Q values in the code :

action_value = reward + self.discount_factor * np.max(self.q_values[k + 1])
self.q_values[k, action] = action_value

Suppose three states self.states[2], self.states[5] and self.states[7] are equal (verified by np.array_equal()). Because equal states should refer to the same Q values, self.q_values[2, action], self.q_values[5, action] and self.q_values[7, action] should have the same values after the update. But it seems that the two lines of code only updates self.q_values[7, action] and keeps self.q_values[2, action] and self.q_values[5, action] intact when k=7.

So do I misunderstand the code and does it matter?

Hvass-Labs commented 5 years ago

I think it has been over 1.5 years since I worked on this. But as far as I remember, the states are basically images from the game-screen. The neural network is used to estimate q-values from those images which are then used to take actions. Depending on the action taken and the resulting scores, the q-values are updated. You could have different outcomes from the same states if the environment or the later actions are stochastic. When we batch-train the neural network using many samples whose input-values (states) are identical but whose output-values (q-values) are different, the network probably learns an average of those q-values. I don't see a way to do it differently. How would you identify which q-value is correct (maybe the max)? How would you implement this filtering of the replay memory efficiently?

I'm closing this issue but you're very welcome to experiment with the code and report back what you found out, in case someone in the future has a similar question.