fix scaling factor of REINFORCE

mnoukhov commented 3 years ago

This change adds the correct scaling factor to the REINFORCE estimate, aligning the loss of the sender with the loss of the receiver.

Description

The REINFORCE estimator is based off of the policy gradient theorem that says

$\nabla J(\theta) \propto \sum_s \mu_s \sum_a q_\pi(s,a) \nabla \pi(a|s,\theta)$

And from there we do a monte-carlo estimate to get the REINFORCE estimator

$\=\mathbb{E}_\pi \big\lbrack \sum_a q_\pi(S_t,a) \nabla \pi(a|S_t,\theta) \big\rbrack$

But what is important to note is the scale of the proportionality in the policy gradient equation. The scale is equivalent to the length of the episode for a finite-length MDP (Sutton and Barto, 2018). In the case of emergent communication, the length of the episode corresponds to the length of the message. Since we divide the receiver's loss by the output length (in using the .mean()) we should also divide the sender's loss by its message length to ensure the same scale of gradient.

Related Issue (if any)

Motivation and Context

From a theoretical perspective, this should align the magnitude of the gradients for the sender and the receiver when using variable length messages and a REINFORCE estimator.

How Has This Been Tested?

I haven't run any tests yet. I would be interested in hearing suggestions about which zoo/paper experiments I should replicate with the correction to see if I get different/better results.

eugene-kharitonov commented 3 years ago

Hello Michael, thanks for looking into this.

My main motivation here was that the entire message is a single action taken by the Sender; and it is the entire message which is assigned a reward. I however can imagine situation where it might make sense to give partial rewards. Is that what you had in mind?

If so, I am also not sure about averaging in the log-space. Could you point to the page/equation of Sutton & Barto you had in mind?

robertodessi commented 3 years ago

Hey @mnoukhov, are there any updates on this?

robertodessi commented 3 years ago

Closing this. Feel free to reopen if needed

facebookresearch / EGG