Air-Ring Gesture Recognition: Improving Workflow with Shortcut Keys

Link

Demo Video: https://youtu.be/3UDJmE8ajxY
Frontend Web: https://chian-chen.github.io/BELab-Final/

Introduction

Non-contact control is a current trend, offering more flexible and convenient options compared to traditional buttons and touch screens. While computer vision (CV) is widely used, it has privacy and angle limitations. Our research shows that a nine-axis sensor (LSM9DS1) can match CV's accuracy for gesture recognition. This sensor combines accelerometers, gyroscopes, and magnetometers, capturing distinct patterns in gestures. We've created an Air-Ring prototype, fixing the nine-axis sensor on fingers, defining eight easy gestures (U, D, L, R, O, V, Z, N), and trained a Convolution Neural Network (CNN) with over 5000 data points. With PyAutoGUI, we achieved real-time gesture recognition for keyboard control, surpassing 95% accuracy and recognizing 30 different gestures within 60 seconds.

System Overview

To achieve high flexibility, we created a simple frontend website to facilitate user shortcut key configuration. The completed system flowchart is roughly depicted in the following figure.

Fig. 1. System Flowchart

First, users configure their preferences using the frontend website according to their needs. They map our defined gestures to specific shortcut keys, generating a configuration .json file. Next, they place the generated file in the path where the Python program executes. The data collected by the nine-axis sensor (IMU) is processed using the library and connected to an Arduino UNO. PySerial is used to establish a connection between Python and Arduino. Subsequently, the trained CNN model generates results, which, when mapped to the configured shortcut keys, are sent to PyAutoGUI for gesture-to-keyboard control.

Data Collection

We attached a nine-axis sensor to the fingers using medical breathable tape and silicone rings to collect the three-axis acceleration generated when the fingers move. We sampled the acceleration at 60Hz, collecting 150 points of acceleration time-domain signals (taking 2-3 seconds). There are 8 different hand gestures: up, down, left, right, N, Z, V, and O. To simulate practical use, our system must classify all non-gesture signals as noise, such as hand rest, keyboard or touchpad usage, and grabbing objects. Therefore, we added an additional 9th category: Noise.

Initially, we experimented with the six axes of the nine-axis sensor, but due to differences in sampling frequencies for the magnetic axis and the other two axes, and because the purchased nine-axis sensor had a gyroscope failure, we ultimately used only the three-axis accelerometer from the nine-axis for data collection and recognition.

Fig. 2. Wearable Device Illustration

Furthermore, to reduce variations in training data produced by different users, we defined clear gestures for each category. Users can imagine a virtual 3x3 grid in front of them and move their fingers to the corresponding positions in sequence. The defined gestures and execution steps are shown in the figure (Fig. 3), and the collected data types are shown in Fig. 4.

Fig. 3. Gesture Execution Steps

Fig. 4. Data Sample
(L: Up gesture, R: Right gesture)

To avoid model overfitting and increase the diversity of training data to improve model accuracy, we perform data augmentation after collecting training data. For each training data of a gesture, we add noise with certain weights generated randomly using Gaussian distribution. For the noise category, we use noise generated randomly using Gaussian distribution to represent a stationary state of noise (Fig. 5).

When implementing real-time recognition algorithms, the model continuously recognizes the signal as input, which can lead to incorrect judgments after recognizing the first half of a gesture signal. Therefore, we randomly take 50 points from the collected noise data (Fig. 5, red part), and add them to the pre-processed training data's first 30 points (Fig. 5, blue part), creating new noise data.

We collected approximately 5700 training data in total, with about 450 data points for each gesture and about 2000 data points for noise. You can refer to the data and imgs folders for raw data (.npz) and visualized images.

Fig. 5. Data Augmentation
(L: Noise mixed with data, R: Gaussian distribution noise)

Preproproessing

During data collection, we set the signal length to 150 points to encompass hand gesture signals. However, to reduce dimensionality and find meaningful signal segments, we use a sliding window of 80 points to identify the highest-energy window, after subtracting the signal's mean.

p_{i,j} = \sum_{k = i}^{j} p_k^2

As shown in Fig. 6 with the orange signal and green dots, this approach successfully identifies the segment with the most significant signal variation. This means that we have effectively reduced the signal length from 150 points to 80 points before inputting it into the model. This shorter signal can be used directly for model input or undergo one-dimensional wavelet transformation. Experimental results show similar performance for both approaches, and we select the unprocessed time-domain signal for model input.

Fig. 6

Model

The CNN model excels in finding broader local features using filters, making it successful in various tasks. In this experiment, the CNN model initially employs a 1x16 filter to process the acceleration along each axis, generating 96 feature maps. These feature maps are flattened, and several linear layers are used for classification. The model's output is evaluated using the cross-entropy loss function, and the Adam optimizer is used for parameter adjustments over ten training epochs. To prevent model overfitting, data augmentation is applied to increase training data, and fewer model parameters are used along with a Dropout layer. The model architecture is shown in Fig. 7.

Fig. 7. Model Architecture

The CNN model achieves an accuracy of approximately 97% on test data. From the confusion matrix in Fig. 8, it is evident that the model exhibits high accuracy for each class. However, it occasionally confuses left and right, as well as Z and N gestures, possibly due to the similarities in the signals, which could be attributed to variations in finger force among different users.

Fig. 8. Confusion Matrix

Real-time Algorithm

We continuously store the most recent 150 data points and, after collecting each new data point, we identify the 80 data points with the highest energy within this 150-point dataset. These 80 data points are then fed into the model to obtain a classification result. If the result is not categorized as noise, we execute the corresponding keyboard shortcut settings.

Real-time Data Collection:

Three constantly updating queues for collecting the latest 150 data points (for x, y, and z-axis accelerations, each lasting approximately 2.5 seconds).

Fig. 9. Queue

Process:

Find the contiguous segment of 80 data points with the highest energy from the 150-point queue (this process aligns with the operations performed during the training and testing data collection for the model, as mentioned above).
If any of the three 80-point contiguous segments for x, y, and z-axis data are among the last 5 contiguous segments in the queue, we consider them as potential incomplete hand gesture actions and therefore do not proceed further. We classify them as noise. (Note: The 150-point queue offers 70 possible 80-point contiguous segments to choose from, as there are 150 - 80 = 70 such segments. The "last 5 contiguous segments" refer to the last 5 among these 70 possible choices. The choice of 5 is made based on a trade-off between accuracy and latency.)
If the energy of all three 80-point contiguous segments does not exceed the set threshold, we also do not proceed further and classify it as noise.
Feed these 80 data points into the model to obtain a classification result (which can be a specific hand gesture or noise).
If the result is not categorized as noise, execute the corresponding predefined keyboard shortcut settings and clear the queue to avoid redundant execution of the same shortcut settings.

Evaluation

In our presentation of results, we use two distinct approaches: assessing the speed of hand gesture recognition and simulating real-world usage. These results are showcased in Part 2 and Part 3 of the video presentation.

Speed Testing

For speed evaluation, we employed customized settings from a typing website to measure the time needed for correctly recognizing 30 hand gestures. We used the following gestures: Up, Down, Left, Right, O, V, Z, N, corresponding to (U, D, L, R, O, V, Z, N). Please refer to Fig. 10 for the interface.

Fig. 10. monkeytype

On average, it takes around 60 seconds to correctly recognize 30 hand gestures, which translates to roughly two seconds per recognition. This speed closely aligns with the data collection approach used. Our data collection involved sampling data at a rate of around 60Hz, with 150 data points (2-3 seconds) used for training data. Therefore, in real-time systems, not utilizing a similar data acquisition method significantly impacts accuracy, making it challenging to address this limitation through algorithm enhancements.

In addition, we measured the bottleneck time without considering hardware data collection. This time delay is mainly associated with software execution. Data preprocessing and basic logic checks are essentially negligible. The primary delay occurs during model recognition and the execution of PyAutoGUI keyboard shortcuts, with PyAutoGUI outperforming other operations. After exploring alternative modules for keyboard shortcut operations, we found PyAutoGUI to offer relatively faster execution. Although other modules provide more advanced functionalities, they tend to be slower and do not align with our requirements. As a result, software speed optimization remains an ongoing challenge.

Simulated Real-world Usage

To demonstrate real-world usage, we utilized Adobe Photoshop software. In the second part of the video, we showcased the use of five sets of less commonly employed keyboard shortcuts. In practical use, our system effectively filters out common workplace noise, preventing false positives and consistently achieving good results in terms of speed and accuracy.

Program Execution Speed	Shortcuts Mapping
\| Operation \| Time \| \| :-: \| :-: \| \| Data Processing \| <0.001s \| \| Model Recognition (CNN Model) \| 0.003 - 0.005s \| \| Keyboard Shortcut (PyAutoGUI) \| 0.1s \|	\| Gesture \| Shortcut \| \| :-: \| :-: \| \| Gesture Up \| command + shift + E \| \| Gesture V \| command + shift + S \| \| Gesture 0 \| command + shift + U \| \| Gesture N \| command + O \| \| Gesture Z \| command + U \|

Conclusion

This project utilized a nine-axis sensor (LSM9DS1) connected to an Arduino UNO development board to create a prototype of the gesture-controlled keyboard device, Air-Ring. By collecting acceleration signals generated during movement with the nine-axis sensor, preprocessing these signals, and inputting them into a trained CNN model for gesture recognition, we achieved an accuracy of over 95% on the CNN Model. In terms of speed, we accomplished the continuous recognition of 30 correct gestures within 60 seconds.

Through our developed frontend website, users can easily establish the correspondence between shortcuts and the model's recognition results. This allows users to enhance efficiency and productivity in any software application by using gestures. Currently, the main bottleneck lies in the method of data collection, and improvements in this aspect could further enhance both speed and accuracy.

Once speed, accuracy, and device comfort reach their peak, this device has the potential to offer users a convenient means of operation across various applications, making it a more versatile computer controller.

Reference

[1] Andrey Ignatov. Real-time human activity recognition from accelerometer data using convolutional neural networks. Applied Soft Computing, 62:915–922, 2018.

chian-chen / BELab

readme