Write Methods & Materials Section

Written Methods and Materials:

Methods and Materials

Leveraging Convolutional Neural Networks One of the main tools used to improve image classification algorithms, especially with animals, is Convolutional Neural Networks (Süzen, 2020). A Convolutional Neural Network (CNN) reduces image size while helping identify patterns in the image. For example, a CNN may look for a triangle pattern similar to the top of a dog’s ear, in which it would produce an image with high values mainly near the dog’s ear in the picture given. This also creates the opportunity to use layers of CNNs that look for different patterns, which identifies more features, making the input data more useful and accuracy higher (Süzen, 2020). This project will be utilizing CNNs prebuilt and pre-trained in the models YOLO Tinyv4 and ssd_mobilenet_v2 (run on Tensorflow).

While Convolutional Neural Networks help greatly increase the performance of computer vision algorithms, they can also cause an increase in processing requirement with their multitude of layers. This is why it’s important to choose an optimal resolution and layer count for the intended task. Reducing the resolution of each layer can greatly help with increasing performance because it reduces the total amount of data the neural network must analyze (Mattal, 2018). However, this may reduce the accuracy of the algorithm because some data may be lost, but this effect is rarely large. The other primary approach to increasing performance of a CNN is by reducing the amount of layers, which similar to reducing resolution, will increase performance while potentially decreasing accuracy (Mattal, 2018).

YOLO Most work with object detection and CNNs use classifiers to perform detection, in which regions of interest in an image are selected and then convolutional neural networks are run on each region of the image. This generally results in a slow algorithm because predictions need to be run sometimes thousands of times on a single image. YOLO, or “You Only Look Once,” however, frames detection as a regression problem. A regression algorithm detects bounding boxes and class probabilities for the whole image in one run of the algorithm. Because of this, YOLO is able to be extremely fast, while maintaining comparable if not superior accuracy over other implementations. This speed is very important for our application as we are attempting to create a camera that needs to be able to make decisions in real time as a pet approaches the door. Additionally, implementing an algorithm onto the NVIDIA Jetson Nano, with its limited processing power, requires a fast algorithm that can be easily run on the small computer. This single prediction approach to object detection also benefits the algorithm by allowing the prediction to be based on the global context of the image rather than separated points of interest in the image.

In order to use the algorithm contained in YOLO for pet detection, we require pre-trained weights obtained by running pictures with their correlated annotations through a training command. The data must first be obtained in the right format. A collection of 2,576 photographs of pets annotated by their species or breed from the University of Oxford was downloaded online in the darknet TXT format to serve as a baseline dataset. Additional photos were added to designate the specific pet that is to be recognized by the door. These photos were manually labeled using the labelimg software that allowed us to draw bounding boxes around an object to be recognized, and label that object as whatever we wish. This was used to label exactly where the face of the pet is in the photo.

After uploading these photos and their labels to the device, a config file was downloaded and updated for information such as the image width and height, batches, steps, and number of classes. Next, additional files were created that the program used to determine which of the uploaded files are to be used in the testing and training parts of the algorithm. This is done through a processing python script that randomly copies the title of each file into a text document labeled either “train.txt” or “test.txt”. When the training program runs, it uses the files with names in the training file to check the weights and update on them, and then checks them again. After the training is complete, files with names in the test file were used to evaluate the performance of the newly created weights with fresh data. Additionally, files and directories were created to direct the training program to all of the information it needs, including the correct files to use, which classes are being looked for, and the correct classes within each of the photos.

With all of the data uploaded, and relayed to the YOLOv4-tiny command, the pre-built training algorithm can begin to run. This program ran through the files in the testing set, randomly adjusting the weights until it reached the set number of batches, or randomizations, specified in the configuration file. This program returns files containing the weights for the neural network at every thousandth iteration as well as the weights that it believes to be the best. However, sometimes these weights can become “over-trained” and become too good at detecting the features of the data in the training set, but will not be applicable to other data that will be used in the future. To solve for this, each set of weights and the thousandth interval were tested for mean average precision on the testing data and graphed on a line chart. If the line ever begins to fall after a certain point, this is proof of over-training and the peak weights will be used for the final algorithm instead.

The metrics of mean average precision, average intersection over union, F1 score, and the framerate were calculated on a uniform dataset including one unique pet, and a comparison can take place between alternative methods to finally evaluate the implementation of the most effective algorithm in the use case of opening a pet door.

Tensorflow One would be correct to assume that object detection algorithms are on the more complex side of machine learning, but with Google's Tensorflow API and python module, creating a neural network for object detection can be done with a couple lines of code. Google also provides pre-trained state-of-the-art models of varying speed and accuracy in the Tensorflow model garden. These pretrained models make it much easier to create and train complex object detection models, but they have the downside of not being as customizable and optimizable as a model from scratch. This project will be employing the ssd_mobilenet_v2 model on Tensorflow and its optimization modules. This model stands for single-shot-detector in which it takes one pass over the image to detect objects, making it optimal for speed and algorithms for mobile devices, and though it takes one look at the image, the ssd_mobilenet_v2 model still performs with rather high accuracy.

While Tensorflow makes machine learning development much easier, the models are still very difficult to run and consume a lot of memory. To help with this, Tensorflow has provided optimization tools such as TensorRT, which are able to greatly reduce the computational load of the model while maintaining almost all of its accuracy. Tensorflow provides an easy to use interface with a multitude of models to choose from, along with seemingly endless support resources, which is why it will be employed for this project.

TensorRT Because Tensorflow uses a very large amount of memory and resources, optimization modules are required to simplify the models to be run on smaller devices. One of these optimization modules is TensorRT. TensorRT optimizes a model to be a bit simpler, but focuses more on having the model run as efficiently as possible on NVIDIA GPUs, which is why it’s so important for the Jetson.

TensorRT utilizes multiple optimization techniques for the model. It first starts by reducing the precision of the numbers within a model. This involves things such as changing from floating point 32 numbers to floating point 16 numbers. This makes the model simpler to run while theoretically only sacrificing a small percentage of performance. It then takes multiple methods including Layer and Tensor Fusion, Kernel auto-tuning, and Dynamic Tensor Memory to help optimize the model to run more efficiently on GPU memory and faster on GPU processing.

tfLite Similar to tensorRT, tfLite is a Tensorflow model optimization module, however, it optimizes the model to be as simple as possible to run on the single core of a CPU, which is why it’s much more common for systems such as the Raspberry Pi.

With the pure goal of reducing the size and load of the model, tfLite employs three primary optimization techniques. It first reduces the precision of the numbers within the model (ex: going from FP32 to FP16). It then goes through the pruning process by removing minor parameters, resulting in more efficient model compression. Lastly, to further simplify each layer, it replaces the layer weights by clustering them and replacing them with the centers of these clusters, thus reducing the amount of neurons in the layer, which reduces the model complexity.

JasonDCox / ML-Mentorship-GovSchool

Write Methods & Materials Section #35