Implementation of some expectation is different from paper

Hi, you are right, there is a small difference between the paper derivation and the actual implementation to compute A.

In the paper, the advantage A(sg | s, g) = E_{sg_hat ~ piH(.|s, g)} [C(sg_hat|s, g)] – C(sg |s, g) is the difference of two terms, the first being an expectation with respect to piH(.|s, g). In principle, we should average C(sg_hat |s, g) for multiple subgoals sg_hat sampled from piH(.|s, g) to approximate this expectation, which could be computationally expensive depending on the number of samples we choose.

In practice, we found that using only the mean of piH(.|s, g) for sg_hat was simpler and faster as well as more stable than a sampling based approximation, although I haven’t looked into this in depth.

elliotchanesane31 / RIS

Implementation of some expectation is different from paper #2