Figure out why simple model does not learn while using hidden layers

It seems the problem is related to problematic learning of RNNs. One idea that comes to my mind is that using such large network with multiple hidden RNN steps might cause gradient vanishing and be problematic in case of learning overall. Also, we have figured out that when we decreased the number of hidden time steps and increased the learning rate the simple model starts learning which might be the proof of the correct implementation (the model is just not good enough to learn).

On the other hand, this problem does not affect the complex network that much because there is a DNN after each RNN step applied which might increase the model performance in multiple hidden steps.

dbeinhauer / mcs-source

Figure out why simple model does not learn while using hidden layers #34