Gradient becomes 0 in backpropagation for attention mechanism.
Consider:
[x] torch.clap values while calculating softmax of attention coefficients
[x] Negative slope of LeakyReLU
[x] Initializer for W and a parameters
[x] Normalization
Looks like the problem originates from the LSTM/GRU layer. Gradients of this layer becomes zero very fast, and since the attention layer is behind the recurrent layer, gradients of attention layer are not properly updated.
Applying PowerTransformer to dataset and decreasing learning_rate of RMSprop seems to have helped the gradients a little.
Increasing num_epochs to 50 and setting gradient clip value to 5 has solved the issue.
Decreasing clip value could be further beneficial [will experiment during hyperparameter tuning].
Gradient becomes 0 in backpropagation for attention mechanism.
Consider:
Looks like the problem originates from the LSTM/GRU layer. Gradients of this layer becomes zero very fast, and since the attention layer is behind the recurrent layer, gradients of attention layer are not properly updated.
Applying PowerTransformer to dataset and decreasing learning_rate of RMSprop seems to have helped the gradients a little.
Increasing num_epochs to 50 and setting gradient clip value to 5 has solved the issue.
Decreasing clip value could be further beneficial [will experiment during hyperparameter tuning].