frankfletcher / Sign1

Other
4 stars 3 forks source link

Comments on Capstone 1: Data Wrangling #1

Open brooksandrew opened 3 years ago

brooksandrew commented 3 years ago

RE https://github.com/cogsci2/Sign1/blob/master/notebooks/Sign2%20-%20Implementing%20Mixup-20210113-best.ipynb

Overall looks good. Just a few comments:

frankfletcher commented 3 years ago

Thanks, Andrew. To your points:

I had several issues with the original dataset which inevitably led me to supplementing it.

  1. There were many samples that were taken at very odd angles which are unlikely to be seen by the average viewer.
  2. There were lighting and other quality issues (like visible interleaving) with most of the data. I wanted those to be represented, but I also wanted cleaner signs represented.
  3. Actually I felt some of the signs were not correct and I had to remove some of them.
  4. Much of the data provided isolated the hands which is not the normal situation. Normally, the camera would also capture much of the person as well as the background.
  5. Most of the data was very similar and I wanted to increase the variety of data, including different backgrounds, different lighting, etc.

The current proof-of-concept evaluation is pretty minimal. The static evaluation consists of a subset of the original data. This means that the same conditions that exist in training, also exist in the analysis. That's a real issue that needs to be addressedif I were to move onward with the project.

The second part of the evaluation is the use of video analysis. This evaluation is quite different in that there is a threshold to exceed before a letter is translated. It's not just a straight argmax. Also, since one sign may be spread across 10 frames, there is more opportunity for the model to find the translation. In the static evaluation, only one frame is provided.

The use of video makes it difficult to establish metrics. How does one define groud-truth in a video? the only way I can think to do it is to pre-record the video and label each frame. Actually, that pretty much describes the setup for creating the images right now. As the frames are captured, they are automatically labelled (label is predefined at the start of capture). We can then run evaluations on any new data that way. It's not the same thing as video but it's close.

In my current anecdotal experiments, the recognition is fairly good on both of my roommates (an older gentleman and a younger woman, both with caucasian-like features). We have also experimented with slightly different lighting scenarios (indoors, outdoors, front lighting, back lighting, side lighting, etc) For an experimental phase, the model remains fairly robust in these environments.

The model has more difficulty with noisy backgrounds. I assume because it's difficult to differentiate the hand signs from the background.

Properly creating a (non-experimental) model and testing it would require outside help and would include (but not limited to):

  1. different gender signers
  2. different race signers
  3. different body types
  4. different ages
  5. different lighting conditions
  6. different backgrounds
  7. different cameras
  8. different heights and camera orientations
  9. different levels of "hand shake" while "holding" the camera.
  10. different body orientations (optional) - signing at a person who is not near the camera.

Given the time constraints and the proof-of-concept nature of this project, this work has been started yet. In reality, I would re-design the model pipeline before any of that work.

It's a very rudimentary project but it has really taught me a lot about deep learning in general, it's strengths and weaknesses, how to tune models, regularization, using computer vision, video analysis, testing and more.