It is clear that controller falls into a local optimal while it can't find better actions from REINFORCE. I think unknown c of c/valid ppl, moving average baseline and temperature of logits are what needed to be fixed. See more details (especially TODOs) in 497c2e717dc0087fea52d4f196d30543e4fb7512.
It is clear that controller falls into a local optimal while it can't find better actions from REINFORCE. I think unknown
c
ofc/valid ppl
, moving average baseline and temperature of logits are what needed to be fixed. See more details (especiallyTODO
s) in 497c2e717dc0087fea52d4f196d30543e4fb7512.