Reproduce ENAS results on RNN

carpedm20 / ENAS-pytorch

PyTorch implementation of "Efficient Neural Architecture Search via Parameters Sharing"

Apache License 2.0

2.7k stars 494 forks source link

Reproduce ENAS results on RNN #9

Open Wronskia opened 6 years ago

Wronskia commented 6 years ago

Hello @carpedm20 ,

Thanks a lot for this nice implementation of the ENAS paper. Did you manage to reproduce their results by retraining the model from scratch?

Thanks, Best

carpedm20 commented 6 years ago

No. I couldn't reproduce the results of the paper. As far as I experimented, training of ENAS was very unstable with this code and I couldn't figure out the problem yet. Below are what I'm not sure about:

shared or unshared decoder of the controller
moving average baseline
next hidden state (h[-1] or avg(h))
some hyperparameters in config.py (marked with TODO)
loss of REINFORCE is always negative
exploration

dukebw commented 6 years ago

I can comment on 5: the loss of REINFORCE is not always negative. The total loss, however, is almost always negative because the negative entropy of the policy's logits is added to the total loss (in order to maximize policy entropy), and the entropy is always positive. I also have a fix for the entropy calculation in my fork.

Howal commented 6 years ago

@carpedm20 As far as I know, the E[Reward(m, omega)] should be calculated under the meaning of expectation, which means you are supposed to sample several models and average those rewards for each step while training controller. But you just sample one model for calculating Reward() in your code. (I'm not quite sure with this)

As the author said, while training child model, M=1 works fine to estimate E[Loss(m, omega)]. But "we needed at least M=10 to training the policy π". You can find this sentence at https://openreview.net/forum?id=ByQZjx-0-&noteId=BkrqNswgf. It's almost the last sentence.

carpedm20 commented 6 years ago

@Howal Thanks for pointing this out. I did think it's weird to update a policy network with only one sample.. this seems important issue which will improve the stability of REINFORCE training.