Open dbeinhauer opened 2 days ago
It seems the problem is related to problematic learning of RNNs. One idea that comes to my mind is that using such large network with multiple hidden RNN steps might cause gradient vanishing and be problematic in case of learning overall. Also, we have figured out that when we decreased the number of hidden time steps and increased the learning rate the simple model starts learning which might be the proof of the correct implementation (the model is just not good enough to learn).
On the other hand, this problem does not affect the complex network that much because there is a DNN after each RNN step applied which might increase the model performance in multiple hidden steps.
While training the simple model on hidden time steps it does not learn. Find out whether the problem is connected regarding bad hyperparameters selection or whether it is connected to wrong implementation of the model.