Fix vanishing gradient problem in attention mechanism and recurrent layer

Gradient becomes 0 in backpropagation for attention mechanism.

Consider:

[x] torch.clap values while calculating softmax of attention coefficients
[x] Negative slope of LeakyReLU
[x] Initializer for W and a parameters
[x] Normalization

Looks like the problem originates from the LSTM/GRU layer. Gradients of this layer becomes zero very fast, and since the attention layer is behind the recurrent layer, gradients of attention layer are not properly updated.

training_grad_2 0_norm_model lstm weight_hh_l0_epoch

Applying PowerTransformer to dataset and decreasing learning_rate of RMSprop seems to have helped the gradients a little.

Increasing num_epochs to 50 and setting gradient clip value to 5 has solved the issue.

Decreasing clip value could be further beneficial [will experiment during hyperparameter tuning].

barkincavdaroglu / Link-Prediction-Mesh-Network

Fix vanishing gradient problem in attention mechanism and recurrent layer #6