Closed mnoukhov closed 3 years ago
Hello Michael, thanks for looking into this.
My main motivation here was that the entire message is a single action taken by the Sender; and it is the entire message which is assigned a reward. I however can imagine situation where it might make sense to give partial rewards. Is that what you had in mind?
If so, I am also not sure about averaging in the log-space. Could you point to the page/equation of Sutton & Barto you had in mind?
Hey @mnoukhov, are there any updates on this?
Closing this. Feel free to reopen if needed
This change adds the correct scaling factor to the REINFORCE estimate, aligning the loss of the sender with the loss of the receiver.
Description
The REINFORCE estimator is based off of the policy gradient theorem that says
And from there we do a monte-carlo estimate to get the REINFORCE estimator
But what is important to note is the scale of the proportionality in the policy gradient equation. The scale is equivalent to the length of the episode for a finite-length MDP (Sutton and Barto, 2018). In the case of emergent communication, the length of the episode corresponds to the length of the message. Since we divide the receiver's loss by the output length (in using the
.mean()
) we should also divide the sender's loss by its message length to ensure the same scale of gradient.Related Issue (if any)
Motivation and Context
From a theoretical perspective, this should align the magnitude of the gradients for the sender and the receiver when using variable length messages and a REINFORCE estimator.
How Has This Been Tested?
I haven't run any tests yet. I would be interested in hearing suggestions about which zoo/paper experiments I should replicate with the correction to see if I get different/better results.