Open Wronskia opened 6 years ago
No. I couldn't reproduce the results of the paper. As far as I experimented, training of ENAS was very unstable with this code and I couldn't figure out the problem yet. Below are what I'm not sure about:
config.py
(marked with TODO
)I can comment on 5: the loss of REINFORCE is not always negative. The total loss, however, is almost always negative because the negative entropy of the policy's logits is added to the total loss (in order to maximize policy entropy), and the entropy is always positive. I also have a fix for the entropy calculation in my fork.
@carpedm20 As far as I know, the E[Reward(m, omega)] should be calculated under the meaning of expectation, which means you are supposed to sample several models and average those rewards for each step while training controller. But you just sample one model for calculating Reward() in your code. (I'm not quite sure with this)
As the author said, while training child model, M=1 works fine to estimate E[Loss(m, omega)]. But "we needed at least M=10 to training the policy π". You can find this sentence at https://openreview.net/forum?id=ByQZjx-0-¬eId=BkrqNswgf. It's almost the last sentence.
@Howal Thanks for pointing this out. I did think it's weird to update a policy network with only one sample.. this seems important issue which will improve the stability of REINFORCE training.
Hello @carpedm20 ,
Thanks a lot for this nice implementation of the ENAS paper. Did you manage to reproduce their results by retraining the model from scratch?
Thanks, Best