This work focuses on developing a Deep Learning model capable of using the optical flow produced by the gestures of a signer performing a specific ASL sign and predicting the correct gloss. The work explores models which use spatial and temporal information. There are a total of 20 glosses in the dataset.
cv2: sudo apt-get install python-opencv
numpy: sudo pip install numpy
tensorflow: pip install --upgrade tensorflow
scipy: sudo pip install numpy scipy
scikit-learn: sudo pip install scikit-learn
pillow: sudo pip install pillow
h5py: sudo pip install h5py
keras: sudo pip install keras
matplotlib: sudo pip install matplotlib
Execute the following command: python experiments.py agent
where agent can be any of the these 3 values {random, bias, conv}
Random: It chooses randomly any gloss.
Bias: It chooses the gloss with more repetitions.
Convolutional Model: It is a convolutional network trained using supervised learning. The input to the network is the cumulative optical flow of the video of the person performing a sign, the output is the gloss. In order to train the model categorical crossentropy loss was used and the weights were optimized using Adam optimizer.
Currently there is no way to input through the command line parameters such as the number of epoches,fractions to split the dataset, pooling and kernel sizes, model name among others. It is easy to change them by going to the file experiments.py and to the function of the specific agent and change the parameters in there.
The final total loss.
The top 4 accuracy.
A confussion matrix.
It saves in the directory results some visualisation of the training including accuracy per epoch, loss per epoch, a confusion matrix image. It also saves the trained model and its weights.
Rank1 Accuraccy: 55.4% Rank2 Accuraccy: 66.2% Rank3 Accuraccy: 72.2% Rank4 Accuraccy: 77.4%
Confusion Matrix
Top 4 Accuracy Per Epoch
Validation and Training Loss
TSNE applied to the feature vectors created by the trained convolutional model for each video of the signer. This shows in 2d how the model learns create similar representations (small eucledean distance) for the same and similar classes and different representations for different classes.