Action Perception Design

Overview

This design document covers the action perception component of the pipeline. After detecting and classifying hand objects in the video frame, we need to interpret an action over the range of consecutive frames. This document details how action perception will be designed and then built initially. In particular, we will focus on a mediocre implementation of action perception to enable us to run the project end to end.

Context

Action perception is perhaps the hardest component. Whereas training the models have been done before and can be customized with ease, this part requires creating a unique, intuitive, yet robust user experience. For example, we want shooting a gun to simply just be pointing a finger at the camera and pulling the trigger. However what happens when a detection is missing? Or we misclassify the hand? What happens when we introduce two hands that are interacting? These are some of the many problems that will come up in creating a fluid experience using AR FPS.

Goals

Draw a DFA of the different states a user's actions can take us to. I will insert a drawing of this DFA later.
Ensure basic action, ignoring potential corner cases, are interpreted, and output corresponding instructions to send to the keyboard + mouse. The only basic action we can interpret right now is shooting (point_foreward to point_up). To test this, we need to annotate a video alongside the corresponding actions. We can store the action inside the xml files as an action attribute. We will keep it simple initially by just writing interpretations for an array of strings.

Milestones

Milestone 1 - See if the detector + classifier is trained enough to perceive hands. Otherwise use ground truth boxes with the classifier.

Milestone 2 - Build basic action perception: TBD

Milestone 3 - Draw DFA: TBD

GarlandZhang / ar_fps_controller