DeepASL Project

Objective

This work focuses on developing a Deep Learning model capable of using the optical flow produced by the gestures of a signer performing a specific ASL sign and predicting the correct gloss. The work explores models which use spatial and temporal information. There are a total of 20 glosses in the dataset.

DeepASL

For More Information

DeepASL Paper Report

Require Libraries

cv2: sudo apt-get install python-opencv
numpy: sudo pip install numpy
tensorflow: pip install --upgrade tensorflow
scipy: sudo pip install numpy scipy
scikit-learn: sudo pip install scikit-learn
pillow: sudo pip install pillow
h5py: sudo pip install h5py
keras: sudo pip install keras
matplotlib: sudo pip install matplotlib

How To Run The Code

Execute the following command: python experiments.py agent where agent can be any of the these 3 values {random, bias, conv}

Agents

Random: It chooses randomly any gloss.
Bias: It chooses the gloss with more repetitions.
Convolutional Model: It is a convolutional network trained using supervised learning. The input to the network is the cumulative optical flow of the video of the person performing a sign, the output is the gloss. In order to train the model categorical crossentropy loss was used and the weights were optimized using Adam optimizer.

Convolutional Model

LSTM Model: This model instead of a cumulative optical flow of the entire video, uses a the optical flow of each pair of frames and combine them using an LSTM in order to learn temporal information. This agent can not be currently tested due to an issue uploading the data. It will be available soon.

Changing Hyperparameters

Currently there is no way to input through the command line parameters such as the number of epoches,fractions to split the dataset, pooling and kernel sizes, model name among others. It is easy to change them by going to the file experiments.py and to the function of the specific agent and change the parameters in there.

Output

The final total loss.
The top 4 accuracy.
A confussion matrix.
It saves in the directory results some visualisation of the training including accuracy per epoch, loss per epoch, a confusion matrix image. It also saves the trained model and its weights.

Some Results Of The Convolutional Model

Rank1 Accuraccy: 55.4% Rank2 Accuraccy: 66.2% Rank3 Accuraccy: 72.2% Rank4 Accuraccy: 77.4%

Confusion Matrix

Top 4 Accuracy Per Epoch

Validation and Training Loss

TSNE applied to the feature vectors created by the trained convolutional model for each video of the signer. This shows in 2d how the model learns create similar representations (small eucledean distance) for the same and similar classes and different representations for different classes.

TSNE

CognitionTree / Deep-ASL-Translator

readme