👐OpenHands : Making Sign Language Recognition Accessible. | **NOTE:** No longer actively maintained. If you are interested to own this and take it forward, please raise an issue
I read your paper titled "OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages" on https://arxiv.org/abs/2110.05877.
On page 4, you state:
"For the RNN model, we use a 4-layered bidirectional LSTM with a hidden layer dimension of 128, which takes as input the frame-wise pose representation of 27 keypoints with 2 coordinates each, resulting in a vector of 54 points per frame. We also use a temporal attention layer to weight the most effective frames for classification."
However, I couldn't find a definition of "temporal attention" as used in your method. Could you please explain it?
I read your paper titled "OpenHands: Making Sign Language Recognition Accessible with Pose-based Pretrained Models across Languages" on https://arxiv.org/abs/2110.05877.
On page 4, you state: "For the RNN model, we use a 4-layered bidirectional LSTM with a hidden layer dimension of 128, which takes as input the frame-wise pose representation of 27 keypoints with 2 coordinates each, resulting in a vector of 54 points per frame. We also use a temporal attention layer to weight the most effective frames for classification."
However, I couldn't find a definition of "temporal attention" as used in your method. Could you please explain it?