Goncalo287 / savi_t2

0 stars 0 forks source link

SAVI - Trabalho Prático 2

Sistemas Avançados de Visualização Industrial (SAVI) - Grupo 3 - Universidade de Aveiro - 2023/24

Table of Contents


Introduction

In this assignment, a point cloud based model was created and trained to guess objects displayed in different scenes. This program needs to pre-process the scene to retrieve each object and its properties to feed the model, narrating the object prediction and its characteristics through a text-to-speech library. Furthermore, this model was also applied to a real-time system using a RGB-D camera.


Datasets Used

To train the aforementioned classifier, it was used the Washington RGB-D Object Dataset. Therefore, it was used:

To develop this project, a dataset splitter was used to divide the dataset files into training, validation, and testing sets. In order to prevent the model predictions from becoming biased, objects for testing were selected manually. This decision was made because, within each dataset folder, the object is the same. This division can be found in the used Dataset Splitter.


Libraries Used

To run the program and scripts presented in this repository, some libraries need to be installed beforehand. These are the following:


Installation

To ensure the program runs as intended, the steps presented below should be followed.

  1. Clone the repository:
    git clone https://github.com/Goncalo287/savi_t2/
  2. Change into the project directory:
    cd savi_t2
  3. Run the program:
    ./main.py

Code Explanation

Training the model To train the model with Pointclouds information, a [PointNet](http://stanford.edu/~rqi/pointnet/) architecture was utilized. It consumes an entire point cloud, learns a spatial encoding of each point, aggregates learned encodings into features and feeds them into a classifier. One advantage of this architecture is that it learns the global representation of the input, ensuring that the results are independent of the orientation of the Pointcloud. In this network architecture, there are several shared MLPs (1D Convolutions) from which critical points are extracted using a Max Pooling function. These critical points (outputs) are fed into a classifier that predicts each object class. Additional detailed information about this architecture can be found at ["An Intuitive Introduction to Point Net"](https://medium.com/@itberrios6/introduction-to-point-net-d23f43aa87d2). To optimize the classifier parameters, a PointNetLoss function was implemented. In this function, the [Negative Log Likelihood Loss (NLLLOSS)](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) criterion was used to refine the model parameters during the training phase to improve validation results. To prevent overfitting during the training phase, the model was only saved when the validation error was minimum compared to those saved during the training process. ```python3 def pointnetloss(outputs, labels, m3x3, m64x64, alpha = 0.0001): criterion = torch.nn.NLLLoss() ```
Scene Preprocessing To feed the classifier mentioned earlier, it is necessary to isolate the objects present in each scene. For this purpose, a script based on [Open3D](https://www.open3d.org/docs/release/) was developed to achieve the desired outcome for all scenes in an automated manner. Initially, the script detects the table, which consists solely of horizontal points. Subsequently, all points above the table, representing the objects, are retrieved. Finally, the points are grouped into clusters, where each cluster represents an object. ```python3 cluster_idxs = list(all_objects.cluster_dbscan(eps=0.031, min_points=70, print_progress=True)) obj_idxs = list(set(cluster_idxs)) obj_idxs.remove(-1) # removing all other points ``` Additionally, properties of the objects are extracted, including color and height. These properties, along with the number and type of objects, are provided to the user through a text-to-speech script. Simultaneously, using [threading](https://docs.python.org/3/library/threading.html), a new window appears displaying the objects and their respective data.
Real Time System This part of the program uses the color and depth images from a connected RGBD camera (Astra Pro NL). The depth image is obtained using `openni2` and displayed next to the color image in an `opencv` window. Here, the user can point the camera at the desired location and see the resulting images. When happy with the current images, the user can press Enter to confirm and exit the while loop. A point cloud is generated from the images obtained using `open3d`. This point cloud is then used as a scene to detect and classify objects.

Results

Training Parameters and Resulting Graph

As can be observed from the graph in Figure 1, there is convergence in loss for both training and validation. The models were saved at the points of minimum validation loss to prevent overfitting, as mentioned earlier.

Before starting the training, the following parameters were considered based on several research articles:

| Parameters | Value | | :---: | :---: | | Epochs | 15 | | Training Files | 9000 | | Validation Files | 3600 | | Training Batch Size | 32 | | Validation Batch Size | 64 |

The best model resulting from the training was from epoch 13, with a validation accuracy of 98%. However, after some testing, it was found that the model from epoch 8 proved to be the best for classifying objects in the scenes.

Alt text

Figure 1 - Training and Validation Loss Graph during 15 epochs.

Global and Class Metrics To evaluate the performance of the model generated, a test dataset was created with 4136 files, to be fed to the model. To assess the model's quality, [performance metrics](https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec) were calculated, yielding the following values for the model from epoch 8: | Metrics | Value | | :---: | :---: | | Macro-Averaging Precision | 94.6% | | Macro-Averaging Recall | 95.5% | | F1 Score | 94.9% | | Class "bowl" Precision | 95.6% | | Class "cap" Precision | 79.8% | | Class "cereal box" Precision | 100.0% | | Class "coffee mug" Precision | 100.0% | | Class "soda can" Precision | 97.5% | The global precision was calculated using 'macro' averaging, but the user can choose between ['macro' and 'micro' averaging](https://www.educative.io/answers/what-is-the-difference-between-micro-and-macro-averaging) in the main menu. Furthermore, a [normalized confusion matrix](Results/Confusion_Matrix_Normalized_Epoch8.png) was created to help the user estimate the quality of the model in a faster manner.
Scene Objects Classification

After training the model and preprocessing the scene, each object can be passed through the model to output the predicted label. Finally, the final result can be shown in a results window, so that the user can evaluate all the information.

Alt text

Figure 2 - Objects identified in the scene, predicted labels and respective properties.


Authors

These are the contributors who made this project possible: