carpedm20 / ENAS-pytorch

PyTorch implementation of "Efficient Neural Architecture Search via Parameters Sharing"
Apache License 2.0
2.69k stars 493 forks source link

The question about INF network parameters #6

Closed NewGod closed 6 years ago

NewGod commented 6 years ago

When I was running your code, I found that some network parameters into INF. Is there any suggestion to solve this problem. Thanks. By the way, I was curiosity about the test ppl after training.

bkj commented 6 years ago

Try reducing the learning rate -- that might help the inf problem

dukebw commented 6 years ago

The reason is that for the RNN cell, with the default settings, the controller learns to put a squashing non-linearity (e.g., in this case always tanh) in the path to node 11, which outputs the hidden state for timestep t, i.e., h^{(t)}.

After a while of the shared model learning weights that work well when the hidden to hidden path has a squashing nonlinearity, the controller will randomly decide to make a cell that has no squashing non-linearity. See e.g., this cell here https://ibb.co/cQmeA7. Note that the network drawing in this repo is out of sync with the code and only node 11's output becomes h^{(t)}.

The weights of the nodes that were trained with the tanh will likely at this point have grown such that when there is no squashing, they expand their input and the 35 step sequence causes the activations to explode on the forward pass causing NaNs.

I fixed the issue by clipping the norm of h^{(t)} in my fork of the repo here: https://github.com/dukebw/ENAS-pytorch. I tried a number of other "soft" solutions such as regularizing activations, hidden state norm stabilization etc. but the same problem would randomly crop up anyway.