A YUY2 formatted webcam frame is captured using the MediaFoundationWebcam
class and is fed into the shader pipeline.
Convert_YUY2.comp
decodes it into a full RGB image (one shader invocation for every two pixels to be converted) and ConvertToHSV.comp
convertes it to HSV format (one shader invocation per pixel).
Threshold.comp
will receive the hsv frame and an array of trackers and determine which, if any, tracker detected each pixel as being the correct marker colour (one shader invocation per pixel will loop over each tracker to find a match for that pixel).
Each tracker array element contains upper and lower HSV colours that mark the detection range, plus the colour to turn a pixel which is detected by that tracker.
A rough form of morphological opening and closing is performed on the results to reduce false positives.
ObjectBoundingBoxSearch_Pass1.comp
(one shader invocation per sample per tracker) works on "samples" within the image. A sample is a grid of pixels within the image, and is counted as "detected" for a tracker if at least half the pixels match the tracker's colour.
Due to the thresholding, all pixles will either be one of the tracker colours or be empty (the HSV value (0,0,0,0)
). The shader makes a number of "passes" to determine whether ajoining samples are also detected by the same tracker and how big the entire contiguous area is.
Finally, each tracker is assigned information about the largest of the congiguous areas detected in its colour so that this information can be used in the gesture detection.
DetectGestures.comp
(one shader incocation for every rule of every gesture) receives the tracker information as well as a set of rules for each gesture. It returns whether each gesture has been detected based on the location and orientation of each tracker.
Back on the CPU, the GestureInput
related to each gesture is passed to the Gesture
class along with whether the gesture was detected by the GPU. It uses the Windows Development Kit SendInput
method to send or cancle a direct input to whatever window has the focus, dpending on whether the gesture was detected this frame and whether it was detected last frame.
If the focused window is a game, the end result is that each gesture acts to control the game just like a direct keyboard, mouse, or controller input would.
Each instance of the Tracker
class represents one marker you want to track.
It has min and max thresholds for Hue, Saturation, and Value that represent what the colour range of the marker should be.
An OpenCV webcam instance, controlled by the "Webcam" class, provides a frame of video that is fed to the Tracker.track()
method.
This then calls Tracker.isolateColours()
, which uses OpenCV methods to:
The track method can now use the OpenCV moments()
function to work out the x and y co-ordinates of the centre of the (hopefully only) white area that represents the marker.
These co-ordinates are then reduced down into whether the marker is at the top, middle, or bottom of the screen and whether it is on the far left, middle left, centre, middle right, or far right (placing it in any one of fifteen different grid squares).
The process is repeated for all markers and then the individual position of a marker, or its relative position to another marker, can be used by the GestureDetection
class to work out whether the markers are positioned such that they form a gesture that is linked to a keyboard input. This is all handled by the GestureInput
class, which has one insance per gesture and contains a set of GestureRule
s that have to be met in order for the gesture to be detected. It also contains the last state of the gesture (detected or not) and information on what keyboard or mouse input needs to be sent to the system if the gesture is detected.
The Gesture
class receives the GestureInput
and whether the gesture was detected this frame. It uses the Windows Development Kit SendInput
method to send a direct input to whatever window has the focus. If this happens to be a game, the result is that a gesture controls the game just like a direct keyboard input would.