GuessWhatGame / guesswhat

GuessWhat?! Baselines
Apache License 2.0
72 stars 33 forks source link

I want pure REINFORCE based code. #22

Closed yellowjs0304 closed 6 years ago

yellowjs0304 commented 6 years ago

Hi, I would like to experiment with my QGen training model on your . When I read your code "train_qgen_reinforce.py" and report, I realized you used the "baseline, Q function" to reduce the variance. It seems to be used to improve the performance.

But this isn't looks like pure REINFORCE algorithms I would like to experiment my model with pure REINFORCE based model.

If you have it, could you give me pure REINFORCE based code?(no baseline, no Q function)

Thank you.

harm-devries commented 6 years ago

Hi,

We indeed use a state-value baseline (trained with mean squared error) to reduce the variance. If you want pure reinforce, you could just comment the parts of the code that calculates the baseline. You will need to remove the baseline subtraction in this line, and probably also comment other parts that involve baseline calculations.

yellowjs0304 commented 6 years ago

Thank you for reply.

fstrub95 commented 6 years ago

As a side note, the code is pure REINFORCE code (even with the baseline).

The policy gradient theorem is the following: sum_t sum_a grad(log \pi) (Q - b)

The equation of REINFORCE is the following (Note the sum over the action.) sum_t grad(log \pi) (Q - b)

Where Q is evaluated with Monte-Carlo methods. Besides, REINFORCE is independent of the way you are evaluating b!

I agree that if you start evaluating b as a value-function (or advantage function), then it is not exactly REINFORCE. From a RL point of view, b is a value-function if and only if it is the expected cumulative reward. In our case, the baseline function is computed by mean square error, it is not TD error. Thus, there is no guarantee that the baseline is the expected reward of your policy (gamma=0). Thus, it cannot be considered as a value-function. Thus, we are using REINFORCE ;)

The boundary between Monte-Carlo methods and actor-critics can be small (cf A2C), but there is few rules of thumb you can check. One of them is the following: do you use the TD error to update your estimators? If you do use TD error, then, it is an actor-critic, if you are not using the TD error, it is unlikely that it is an actor-critic. (You can have other cases such as Bellman error, LSTD etc. but this is another story!)