Classification of the ASL Alphabet with deployment to webcam ("realtime" video) as well as an interactive online demo (images) at https://gradio.app/g/cogsci2/Sign1
The technical goal was to check the viability of using sota techniques to translate ASL into written English characters. Based on the success of this experiment, we think it may be possible to use naive-but-modern CNN models for ASL recognition.
Our latest model uses an Efficientnet b4, with tensorflow pretrained weights. We are able to achieve 92% on a third party dataset that was designed specifically to challenge ASL alphabet classifiers.
A slide overview of the project is available here: https://docs.google.com/presentation/d/1CGssA6PaNyEU4xf-YNqp3lbroWTFlCutIiDqM3pc6yE/edit?usp=sharing
Using a GPU with only 8gb of RAM presented it's own challenges when working with large modern CNN architectures. To overcome this limitation we used several techniques:
nn.Sequential
with checkpoint modules. This allowed us to tack on Gradient Checkpoints to almost any pre-loaded model organized with nn.Sequential. We were able to successfully deploy this technique on multiple ResNet variations as well as EfficientNet and DenseNet.Some Notes on this franken-model:
We achieve similar results (~92%) using EfficientNet Lite4 which is a more feasible model to deploy to mobile phones. (https://github.com/cogsci2/Sign1/blob/master/notebooks/Archive/Sign4%20-%20EfficientNetLITE.ipynb)
We used many state of the art techniques as well as attempting over 50 training runs using 9 different architectures. In the end, we decided on a modified EfficientNet architecture. Some of the modifications are listed here:
About half of our data was obtained externally from the following sources: https://www.kaggle.com/grassknoted/asl-alphabet https://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset
Our primary challenge holdout set was created by Dan Rasband. This set was quite interesting as many of the signs were made at what looks to be a construction site with varied backgrounds as well as occasionally strong backlighting. This could easily fool a model into thinking the background was relevent to the sign. https://www.kaggle.com/danrasband/asl-alphabet-test
We also set out to create our own data. We developed a technique to use a webcam to capture image frames and label them - this was done by pressing a character on the keyboard while making a sign. The image was saved to disk and automatically labelled based on the character that was pressed.
Using this technique, we were able to create images at a rate of about 8 frames per second but there were issues. The webcam interface we used didn't allow us to move from one position to another quickly enough; anytime we moved too fast, we had many images with motion blur.
Without the ability to move from position to position, the images obtained would be even more homogenous. Because of that, we found it more effective to use a cell phone to video the sign while continuously moving, changing backgrounds, lighting, and camera angles, and sign variations. We could even walk through different locales while holding the sign.
To process these videos, we developed several small utilities:
as well as some others.
We tried hard to vary our internal data as much as we could within our means:
We really tried to challenge the models. Most of the time, the models barely blinked at our obfuscation efforts.
We developed a webcam deployment, using OpenCV. This deployment allows for semi-realtime interaction with the model. (https://github.com/cogsci2/Sign1/blob/master/notebooks/OpenCV_cam_test.ipynb)
We also deployed the model to the web, using Gradio. This allows anybody with a web browswer (desktop, laptop or cel phone) the ability to upload a snapshot and get the English translation back. Basically, this is a dictionary-type reference system.
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebook source code.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
│
└── tox.ini <- tox file with settings for running tox; see tox.readthedocs.io