carpedm20 / ENAS-pytorch

PyTorch implementation of "Efficient Neural Architecture Search via Parameters Sharing"
Apache License 2.0
2.69k stars 492 forks source link

Consulting perplexity test question with RNN cell in ENAS #47

Open shuotian17 opened 5 years ago

shuotian17 commented 5 years ago

Hello,Kim.Currently I have been testing the RNN architecture in figure 6 found in the article. However, the perplexity I got is about 84 at about 41 epoch, which is not equal to 55.8 found in Table 1, Section 3.2 in ENAS. The details of my test experiment are as follows: In the code, I use the "single" mode in the config.py to train the architecture of the figure 6 in the article. The DAG used is {-1: [Node(id=0, name='tanh')], -2: [Node(id=0, name='tanh')], 0: [Node(id=1, name='tanh')], 1: [Node(id=2, name='ReLU'), Node(id=3, name='tanh')], 2: [Node(id=4, name='ReLU'), Node(id=5, name='tanh'), Node(id=6, name='tanh')], 6: [Node(id=7, name='ReLU')], 7: [Node(id=8, name='ReLU')], 8: [Node(id=9, name='ReLU'), Node(id=10, name='ReLU'), Node(id=11, name='ReLU')], 3: [Node(id=12, name='avg')], 4: [Node(id=12, name='avg')], 5: [Node(id=12, name='avg')], 9: [Node(id=12, name='avg')], 10: [Node(id=12, name='avg')], 11: [Node(id=12, name='avg')], 12: [Node(id=13, name=‘h[t]')]}.

The data set used is PTB. For the Penn Treebank experiments, ω is trained for about 400 steps, each on a minibatch of 64 examples, where the gradient ∇ω is computed using back-propagation through time, truncated at 35-time steps. And I evaluate ppl over the entire validation set (batch size = 1).

The weights were trained using the SGD method, with an initial learning rate of 20, and after 15 epochs, it attenuated with a factor of 0.96. A total of 150 epoch were tested. The number of hidden layers is 1000 and the number of embed layers is 1000, the number of activation functional blocks is 12. The total parameters are (1000 + 1000) 1000 12 = 24M.

As for techniques, Dropout = 0.5. And I set the activation_regularization, temporal_activation_regularization, temporal_activation_regularization_amount as True in config.py to use the weight penalty techniques in the code. Weight Tying is also used in the code. Additionally, I also augment the simple transformations between nodes in the constructed recurrent cell with highway connections(Zilly et al., 2017).

Could please tell me if I have done something wrong with the learning rate or other configurations for testing the DAG in figure 6 of the paper. I am very looking forward to your reply.