What is "Temporal Attention" as used alongside the RNN in your paper?

I read your paper titled "OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages" on https://arxiv.org/abs/2110.05877.

On page 4, you state: "For the RNN model, we use a 4-layered bidirectional LSTM with a hidden layer dimension of 128, which takes as input the frame-wise pose representation of 27 keypoints with 2 coordinates each, resulting in a vector of 54 points per frame. We also use a temporal attention layer to weight the most effective frames for classification."

However, I couldn't find a definition of "temporal attention" as used in your method. Could you please explain it?

AI4Bharat / OpenHands

What is "Temporal Attention" as used alongside the RNN in your paper? #52