Closed GarfieldF closed 4 years ago
td_error = self.greedy[i] * ( self.estimations[states[i + 1]] - self.estimations[state] )
if np.random.rand() < self.epsilon: action = next_positions[np.random.randint(len(next_positions))] action.append(self.symbol) self.greedy[-1] = False This operation would make exploration meaningless, wouldn't it?
No. Because the td errors for the transitions after the exploration step is not zero.
td_error = self.greedy[i] * ( self.estimations[states[i + 1]] - self.estimations[state] )
when exloring greedy is false
if np.random.rand() < self.epsilon: action = next_positions[np.random.randint(len(next_positions))] action.append(self.symbol) self.greedy[-1] = False This operation would make exploration meaningless, wouldn't it?