Comments on Capstone 1: Data Wrangling

Thanks, Andrew. To your points:

I had several issues with the original dataset which inevitably led me to supplementing it.

There were many samples that were taken at very odd angles which are unlikely to be seen by the average viewer.
There were lighting and other quality issues (like visible interleaving) with most of the data. I wanted those to be represented, but I also wanted cleaner signs represented.
Actually I felt some of the signs were not correct and I had to remove some of them.
Much of the data provided isolated the hands which is not the normal situation. Normally, the camera would also capture much of the person as well as the background.
Most of the data was very similar and I wanted to increase the variety of data, including different backgrounds, different lighting, etc.

Most of the data I personally provided was meant to be proof of concept data and it is not meant to be included in any final model. That kind of data-gathering would require a much more elaborate planning phase and setup, including many more signers, more lighting conditions, different types of camera, etc.
After regularization to prevent overfit, the actual highest accuracy I've been able to achieve is about 92%. This is less than perfect but it is an OK metric, given that the planned medium is video and there will be a threshold confidence value to exceed before a translation is offered.
Many issues still exist - strong back lighting and/or flickering light sources make it hard for the model to correctly infer.
Another issue to address is the lack of NO-SIGN recognition.
Another issue is that, as standard image recognition model like resnet is very naive for this use-case. A better model might make use of a pipeline and do the following: 1. Isolate hands. 2. Isolate and track finger movement 3. Infer on only the hands and/or finger movement.
When expanding the proof-of-concept to general signing (which is dynamic), I imagine an improved model would also take into account movement of the hands in relationship to anchors on the body (shoulder position, head position, hip position, etc)

The current proof-of-concept evaluation is pretty minimal. The static evaluation consists of a subset of the original data. This means that the same conditions that exist in training, also exist in the analysis. That's a real issue that needs to be addressedif I were to move onward with the project.

The second part of the evaluation is the use of video analysis. This evaluation is quite different in that there is a threshold to exceed before a letter is translated. It's not just a straight argmax. Also, since one sign may be spread across 10 frames, there is more opportunity for the model to find the translation. In the static evaluation, only one frame is provided.

The use of video makes it difficult to establish metrics. How does one define groud-truth in a video? the only way I can think to do it is to pre-record the video and label each frame. Actually, that pretty much describes the setup for creating the images right now. As the frames are captured, they are automatically labelled (label is predefined at the start of capture). We can then run evaluations on any new data that way. It's not the same thing as video but it's close.

In my current anecdotal experiments, the recognition is fairly good on both of my roommates (an older gentleman and a younger woman, both with caucasian-like features). We have also experimented with slightly different lighting scenarios (indoors, outdoors, front lighting, back lighting, side lighting, etc) For an experimental phase, the model remains fairly robust in these environments.

The model has more difficulty with noisy backgrounds. I assume because it's difficult to differentiate the hand signs from the background.

Properly creating a (non-experimental) model and testing it would require outside help and would include (but not limited to):

different gender signers
different race signers
different body types
different ages
different lighting conditions
different backgrounds
different cameras
different heights and camera orientations
different levels of "hand shake" while "holding" the camera.
different body orientations (optional) - signing at a person who is not near the camera.

Given the time constraints and the proof-of-concept nature of this project, this work has been started yet. In reality, I would re-design the model pipeline before any of that work.

It's a very rudimentary project but it has really taught me a lot about deep learning in general, it's strengths and weaknesses, how to tune models, regularization, using computer vision, video analysis, testing and more.

frankfletcher / Sign1

Comments on Capstone 1: Data Wrangling #1