SAVI - Trabalho Prático 2
Sistemas Avançados de Visualização Industrial (SAVI) - Grupo 3 - Universidade de Aveiro - 2023/24
Table of Contents
Introduction
In this assignment, a point cloud based model was created and trained to guess objects displayed in different scenes. This program needs to pre-process the scene to retrieve each object and its properties to feed the model, narrating the object prediction and its characteristics through a text-to-speech library. Furthermore, this model was also applied to a real-time system using a RGB-D camera.
Datasets Used
To train the aforementioned classifier, it was used the Washington RGB-D Object Dataset. Therefore, it was used:
To develop this project, a dataset splitter was used to divide the dataset files into training, validation, and testing sets. In order to prevent the model predictions from becoming biased, objects for testing were selected manually. This decision was made because, within each dataset folder, the object is the same. This division can be found in the used Dataset Splitter.
Libraries Used
To run the program and scripts presented in this repository, some libraries need to be installed beforehand. These are the following:
Installation
To ensure the program runs as intended, the steps presented below should be followed.
- Clone the repository:
git clone https://github.com/Goncalo287/savi_t2/
- Change into the project directory:
cd savi_t2
- Run the program:
./main.py
Code Explanation
Training the model
To train the model with Pointclouds information, a [PointNet](http://stanford.edu/~rqi/pointnet/) architecture was utilized. It consumes an entire point cloud, learns a spatial encoding of each point, aggregates learned encodings into features and feeds them into a classifier. One advantage of this architecture is that it learns the global representation of the input, ensuring that the results are independent of the orientation of the Pointcloud. In this network architecture, there are several shared MLPs (1D Convolutions) from which critical points are extracted using a Max Pooling function. These critical points (outputs) are fed into a classifier that predicts each object class. Additional detailed information about this architecture can be found at ["An Intuitive Introduction to Point Net"](https://medium.com/@itberrios6/introduction-to-point-net-d23f43aa87d2).
To optimize the classifier parameters, a PointNetLoss function was implemented. In this function, the [Negative Log Likelihood Loss (NLLLOSS)](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) criterion was used to refine the model parameters during the training phase to improve validation results. To prevent overfitting during the training phase, the model was only saved when the validation error was minimum compared to those saved during the training process.
```python3
def pointnetloss(outputs, labels, m3x3, m64x64, alpha = 0.0001):
criterion = torch.nn.NLLLoss()
```
Scene Preprocessing
To feed the classifier mentioned earlier, it is necessary to isolate the objects present in each scene. For this purpose, a script based on [Open3D](https://www.open3d.org/docs/release/) was developed to achieve the desired outcome for all scenes in an automated manner. Initially, the script detects the table, which consists solely of horizontal points. Subsequently, all points above the table, representing the objects, are retrieved. Finally, the points are grouped into clusters, where each cluster represents an object.
```python3
cluster_idxs = list(all_objects.cluster_dbscan(eps=0.031, min_points=70, print_progress=True))
obj_idxs = list(set(cluster_idxs))
obj_idxs.remove(-1) # removing all other points
```
Additionally, properties of the objects are extracted, including color and height. These properties, along with the number and type of objects, are provided to the user through a text-to-speech script. Simultaneously, using [threading](https://docs.python.org/3/library/threading.html), a new window appears displaying the objects and their respective data.
Real Time System
This part of the program uses the color and depth images from a connected RGBD camera (Astra Pro NL). The depth image is obtained using `openni2` and displayed next to the color image in an `opencv` window. Here, the user can point the camera at the desired location and see the resulting images.
When happy with the current images, the user can press Enter to confirm and exit the while loop. A point cloud is generated from the images obtained using `open3d`. This point cloud is then used as a scene to detect and classify objects.
Results
Training Parameters and Resulting Graph
As can be observed from the graph in Figure 1, there is convergence in loss for both training and validation. The models were saved at the points of minimum validation loss to prevent overfitting, as mentioned earlier.
Before starting the training, the following parameters were considered based on several research articles:
| Parameters | Value |
| :---: | :---: |
| Epochs | 15 |
| Training Files | 9000 |
| Validation Files | 3600 |
| Training Batch Size | 32 |
| Validation Batch Size | 64 |
The best model resulting from the training was from epoch 13, with a validation accuracy of 98%. However, after some testing, it was found that the model from epoch 8 proved to be the best for classifying objects in the scenes.
Figure 1 - Training and Validation Loss Graph during 15 epochs.
Global and Class Metrics
To evaluate the performance of the model generated, a test dataset was created with 4136 files, to be fed to the model. To assess the model's quality, [performance metrics](https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec) were calculated, yielding the following values for the model from epoch 8:
| Metrics | Value |
| :---: | :---: |
| Macro-Averaging Precision | 94.6% |
| Macro-Averaging Recall | 95.5% |
| F1 Score | 94.9% |
| Class "bowl" Precision | 95.6% |
| Class "cap" Precision | 79.8% |
| Class "cereal box" Precision | 100.0% |
| Class "coffee mug" Precision | 100.0% |
| Class "soda can" Precision | 97.5% |
The global precision was calculated using 'macro' averaging, but the user can choose between ['macro' and 'micro' averaging](https://www.educative.io/answers/what-is-the-difference-between-micro-and-macro-averaging) in the main menu. Furthermore, a [normalized confusion matrix](Results/Confusion_Matrix_Normalized_Epoch8.png) was created to help the user estimate the quality of the model in a faster manner.
Scene Objects Classification
After training the model and preprocessing the scene, each object can be passed through the model to output the predicted label. Finally, the final result can be shown in a results window, so that the user can evaluate all the information.
Figure 2 - Objects identified in the scene, predicted labels and respective properties.
Authors
These are the contributors who made this project possible: