Why is "experience replay" a reasonable approach?

bcpenggh commented 6 years ago

Dear TA,

Here is the pseudo code in reference [1]

2017-12-21_151522

yj is the estimated target action value, and the equation marked by blue underline is reasonable. However if I understand correctly, yj is summation of current reward and the max rewards estimated for the future. The equation with green underline should be the sum of rewards of the future, but it is actually randomly chosen using the past experience as marked by the red underline. In that case, is yj still the estimated summation of the rewards of the future? [2] said this helps the training to converge. But why is yj in this case still a reasonable target of action value?

BIGBALLON commented 6 years ago

Hello, @bcpenggh

First, we need to know Reinforcement Learning have Two methods: On-policy and Off-policy, Q learning is an Off-policy algorithm, same as DQN.

On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.

Why experience replay?

First, each step of experience is potentially used in many weight updates.
Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
Third, when learning on-policy the current parameters determine the next data sample that the parameters are trained on.

For example, if the maximizing action is to move left then the training samples will be dominated by samples from the left-hand side; if the maximizing action then switches to the right then the training distribution will also switch. It is easy to see how unwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or even diverge catastrophically. By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters. Note that when learning by experience replay, it is necessary to learn off-policy (because our current parameters are different to those used to generate the sample), which motivates the choice of Q-learning.

In that case, is yj still the estimated summation of the rewards of the future? [2] said this helps the training to converge. But why is yj in this case still a reasonable target of action value?

yes, yj is still the estimated summation of the rewards of the future. the red underline randomly chosen Transition from D,
then use the green underline to chose the max Q, so the blue underline is loss function (y_j - Q_j)^2 , which is reasonable.

the loop is:

chosen Transition from D  -> using NN to estimate -> update weight -> NN stronger -> 
chosen Transition from D  -> using NN to estimate -> update weight -> NN stronger ........

so, why you think it's not reasonable?

bcpenggh commented 6 years ago

Hi @BIGBALLON,

As I know, the summation of the future rewards is based on a sequence of states and actions like this.

s_t --> a_t --> s_t+1 --> a_t+1 --> s_t+2 --> a_t+2 --> ...

If the transition is randomly chosen from D, are the chosen states and actions still form a sequence? For example, let say we are at state s_t, and we chose action a_t. It seems we cannot guarantee the randomly chosen s_t+1 is exactly the result of chosen a_t. Do I misunderstand something?

BIGBALLON commented 6 years ago

hello, @bcpenggh

If the transition is randomly chosen from D, are the chosen states and actions still form a sequence? no, we only care about the current states chosen from D, then do action , estimate Q, then get the total reward. we don't care actions from the sequence.

ghbcpeng commented 6 years ago

Hi @BIGBALLON, Based on the definition of the Q value, it is the weighted (discounted) summation of all the rewards till the end of the episode. And Q* is the optimal Q we can get by always choosing the actions with max reward in each state. So I think it is a deterministic process given the current environment if it follows Makrov property. Also, the Q function can also been defined recursively, that is the current Q depends on the next Q. Anyway it seems the sequence matters. However, "experience replay" randomly chose the past transitions and the sequence is no more important. I cannot figure out how to approximate the recursive Q using "experience replay."

BIGBALLON commented 6 years ago

@bcpenggh @ghbcpeng sorry for late reply,

Q function can also been defined recursively

of course, we talk about the MDP in class, then we talk about using Dynamic Programming to solve it. we can solve it recursively like the traditional DP problems. but now we talk about Q-learning, the full name of Q-learning is One-step Q-learning, which means we only consider the current state & the next state, so we can't and not need to care about recursive . (that is what you said "the current Q depends on the next Q", but we only consider one step) Additionally, there are also N-step Q-learning, but it still consider N step, not recursive until terminal.

So "experience replay" is reasonable for Q-learning(here is DQN). Consider the current state(Q_s) & the next state(r+Q_s') Then calculate loss. update by SGD.

See q-learning-tutorial or q_learning_demo.cpp for more details if you have time.

thanks.

bcpenggh commented 6 years ago

Hi @BIGBALLON, Let me use my words to repeat again. We can define Q recursively (N-step Q-learning,) but it is also reasonable to define it by single steps (one-step Q-learning.) So they are different approaches, and both of them work. One-step Q is not an approximation of N-step Q. Actually one-step Q-learning (experience replay) is preferred in reference [1] and [2] because it is easier to converge. Do I understand correctly?

BIGBALLON commented 6 years ago

Let me use my words to repeat again. We can define Q recursively (N-step Q-learning,) but it is also reasonable to define it by single steps (one-step Q-learning.) So they are different approaches, and both of them work. One-step Q is not an approximation of N-step Q.

Yes, I think this part is correct.

Actually one-step Q-learning (experience replay) is preferred in reference [1] and [2] because it is easier to converge.

Yes, maybe one-step Q-learning is easier to converge. But actually Q-learning is NOT the best method. In other words, Q-learning is value-based RL approach, We have other RL approachs (e.g. policy-based or Actor-Critic etc.). You can see the introduction of one-step Q learning, N-step Q learning & A3C in Asynchronous Methods for Deep Reinforcement Learning.

Anyway, your description is correct.

bcpenggh commented 6 years ago

Thanks for your help :+1:

2017-fall-DL-training-program / Reinforcement_Learning

Why is "experience replay" a reasonable approach? #6