drprojects / superpoint_transformer

Official PyTorch implementation of Superpoint Transformer introduced in [ICCV'23] "Efficient 3D Semantic Segmentation with Superpoint Transformer" and SuperCluster introduced in [3DV'24 Oral] "Scalable 3D Panoptic Segmentation As Superpoint Graph Clustering"
MIT License
546 stars 71 forks source link

Error During Data Pre-processing on Custom MLS Dataset #51

Closed xbais closed 8 months ago

xbais commented 8 months ago

Hello there @drprojects, @rjanvier, @loicland, @CharlesGaydon ! Its very nice to see a very well documented, state-of-art architecture which is user-friendly when it comes to setting up and running. Thanks for your work on the Superpoint Transformer.

We (@pyarelalchauhan, @xbais) are trying to train the architecture on a custom dataset collected in India. We have prepared the dataset as Binary PLY files similar to those in the DALES Object dataset (please see the header of one of our files attached below): image

We have generated the relevant Configuration files and other Python files for our dataset taking inspiration from similar files available for DALES and S3DIS datasets provided in your repository. Some of the changes we have made according to our dataset are in these directories:

  1. /configs/datamodule : added our custom YAML file
  2. configs/experiment : added relevant YAML files for our dataset
  3. /data/ : added custom_data/raw/train and custom_data/raw/test
  4. /src/datamodules : added relevant Python file for our dataset.
  5. /src/datasets/ : added relevant custom-data.py and custom-data_config.py files

We have read the posts #32 (related to RANSAC), #36 (in which you talk about the parameters voxel, knn, knn_r, pcp_regularization, pcp_spatial_weight, pcp_cutoff). But we are still facing issues. It will be greate if you can help us out here!!

:point_right: Regarding Errors and Warnings

We are getting the following errors and warnings which we are unable to resolve at the moment :

  1. Warning in Sckit-Learn Regression : image
  2. NAG Related Issue : Cannot compute radius-based horizontal graph : image
  3. ValueError min_samples may not be larger than number of samples: n_samples = 2 : image (Following your advice on #32 , we have already removed "elevation" from partition_hf and point_hf, but still could not get the training to start.
  4. Torch.cat() : expected a non-empty list of Tensors image

:point_right: Regarding Understanding the Configuration

Could you also explain the significance of the value pcp_regularization, pcp_spatial_weight and pcp_cutoff parameters in the /configs/datamodule/custom_data.yaml file.

We are currently using the following configuration values : image

We have tried tweaking these, but cannot get beyond the processing stage for our dataset. Tweaking these params gives one or more of the above mentioned errors and warnings at different stages of processing. Kindly help.


PS : We have already ⭐ ed your repo 😉

xbais commented 8 months ago

Also, it seems that you have used 20 as a factor to normalize the elevation for DALES and Kitti360, and you have used 4 for S3DIS. (reported in your research paper), can you please share how these were calculated, so that we can use this information to calculate it for our own dataset. image

Can we find the z-range (difference between lowest and highest z value in point clouds) for our dataset and use it as the normalizing factor?

drprojects commented 8 months ago

Hi @xbais @pyarelalchauhan ! Thanks for your interest in the project and for this clear and detailed issue. I can tell you searched through existing issues before filing one, I appreciate it :wink:

👉 Regarding Errors and Warnings

It seems to me that all these errors may be pointing to the same thing: one of your clouds is too small. Make sure that you do not have dubious point clouds with like 1 or 2 points only. Are you using xy_tiling or pc_tiling to tile your clouds as a preprocessing step ? If so, inadequately setting these values may produce spurious tilings.

Here is how tiling works. Tiling is optional, if you do not need it, keep xy_tiling=None and pc_tiling=None. Tiling arguments can either be XY tiling or PC tiling but not both. XY tiling will apply a regular grid along the XY axes to the data, regardless of its orientation, shape or density. The value of xy_tiling indicates the number of tiles in each direction. So, if a single int is passed, each cloud will be divided into xy_tiling**2 tiles. PC tiling will recursively split the data wrt the principal component along the XY plane. Each step splits the data in 2, wrt to its geometry. The value of pc_tiling indicates the number of split steps used. Hence, 2**pc_tiling tiles will be created.

custom dataset collected in India

You said you removed "elevation" from partition_hf and point_hf, are you sure you do not have any ground or floor in your dataset ?

PS: I can hardly read your screenshots. Next time, please favor sharing the full traceback like so:

your python traceback goes here

👉 Regarding Understanding the Configuration

See my reply in #50 for interpreting these parameters. Let me know if this is not clear enough.

Before tweaking the partition parameters, I would recommend fixing the above-mentioned errors, which seem related to spurious 1-point point clouds.

Regarding the parameters for the ground plane search with RANSAC

Also, it seems that you have used 20 as a factor to normalize the elevation for DALES and Kitti360, and you have used 4 for S3DIS. (reported in your research paper), can you please share how these were calculated, so that we can use this information to calculate it for our own dataset.

As mentioned in #32, the GroundElevation will use RANSAC to approximate the ground or floor as a plane. This is often needed because different cloud acquisitions have different Z ranges, due to differences in altitude. We do not want the model to learn to reason on absolute Z values, but on elevation wrt to the local ground/floor. Hence my above question: are you sure you do not need the elevation in your dataset ?

That being said, if you have a look at the documentation for GroundElevation you will see that ground_threshold is used to guide ground search to points within [z_min, z_min + ground_threshold]. This is a heuristic to accelerate the RANSAC algorithm and to avoid erroneously fitting the ground plane on large above-ground structures. ground_scale, on the other hand is used to scale the computed elevation (ie Z-distance to the fitted ground/floor plane). We want our model's input features to live withing similar ranges (eg [0, 1] or [-1, 1]), so we scale the elevation with a rough approximate of the maximum elevation in a setup. This is why we set ground_scale to 4 for indoor scenes and 20 for outdoor scenes. You may want to adapt this is you are dealing with, say, an urban environment with 100-meter-high skyscrapers.

xbais commented 8 months ago

That was really helpful @drprojects ! Putting xy_tiling=null really solved the errors during processing, and now we are able to complete the two-step processing. But we are facing the following issues:

:point_right: Unable To Change Training Device (GPU)

I have a server that has 3 GPU devices (with RAMs : 40 GB, 80 GB and 80 GB). By default processing and training both use the first GPU. But this led to segmentation fault so I had to change the device to cuda:2 in /configs/datamodule/my_data.yaml. This prevented segmentation fault during processing. But at the onset of training, the OOM error props up immediately, and the device being used is device 0 (the same error also pops up if we use all 3 devices becuase I think the smallest GPU goes out of memory and terminates the entire training). It appears that the distributed processing automatically takes the first n GPUs where n is specified in the gpu.yaml file.

:heavy_check_mark: Although, we are able to train the architecture on our dataset using all 3 GPU devices by setting devices: 3 in /configs/trainer/gpu.yaml but ... :x: ... but we cannot find a way to specify specific GPU(s) device(s) for training (for example cuda:2 only or cuda:1 and cuda:2). Because sometimes specific GPUs on the server are being used by other students in our research lab.

drprojects commented 8 months ago

Selecting a single GPU

To select which GPU to use for a process, either set the CUDA_VISIBLE_DEVICES environment variable

CUDA_VISIBLE_DEVICES=YOUR_GPU_NUMBER

or do it at the beginning of your python script

import torch
torch.cuda.set_device(0)

Multi-GPU

SPT has only been tested on a single GPU. We do not guarantee multi-GPU preprocessing nor training. Besides, a 40 G GPU (eg NVIDIA V100) is plenty enough for preprocessing and training all datatest in our paper. So if you run into CUDA OOM errors, you might wanna check our related tips & tricks.

If you encounter issues with multi-GPU preprocessing or training and start investigating those, we would gladly welcome a PR :wink:

pyarelalchauhan commented 8 months ago

Great! Exporting the CUDA_VISIBLE_DEVICES worked. Thanks!

Surely we will do a PR if we work on multi-GPU, we are hopeful that we will build upon this architecture, we will share with you accordingly. We are currently having some issues in testing, once we sort them out we will let you know so that this issue can be closed. :wink:

We are both thankful to you for the prompt help.

pyarelalchauhan commented 8 months ago

Issue Regarding Validation Loss

We were just analyzing the results for our last training with 3 GPUs (Multi-GPU) on our dataset. :heavy_check_mark: The train loss looks good and is converging...

image

:x: ...but we found that the validation loss is very high and not decreasing. This appears to be due to over-fitting... Screenshot from 2023-12-08 12-57-02

Here is our graph for validation mIoU: image

Could you please suggest a way to reduce the validation loss? :grimacing: Could it be solved by increasing the number of superpoints (ie by increasing the pcp_spatial_weights ) as suggested in #36 .

drprojects commented 8 months ago

Hi, indeed your validation validation loss is comparatively high, but relatively stable. Beyond a decreasing validation loss, what you truly want, is your validation mIoU to increase. This seems to be the case, though.

Whether the final validation performance of 36.7 mIoU is "good" will depend on your specific dataset. This is not something I can do for you, you will need to investigate it yourself. How ? You should start by doing a lot of visualizations:

You can find some visualization tools provided in notebooks/ to help you get started. For more advanced visualization options, see the show() function.

The rest is up to you, good luck ! :muscle:

xbais commented 8 months ago

Thanks a lot for helping us out!!

zeejja commented 1 month ago

C'était vraiment utile@drprojects! L'installation xy_tiling=nulla vraiment résolu les erreurs lors du traitement, et nous sommes maintenant en mesure de terminer le traitement en deux étapes. Mais nous sommes confrontés aux problèmes suivants :

👉 Impossible de changer le périphérique d'entraînement (GPU)

J'ai un serveur qui possède 3 périphériques GPU (avec RAM : 40 Go, 80 Go et 80 Go). Par défaut, le traitement et l'entraînement utilisent tous deux le premier GPU. Mais cela a conduit à une erreur de segmentation, j'ai donc dû changer le périphérique en cuda:2in /configs/datamodule/my_data.yaml. Cela a empêché l'erreur de segmentation pendant le traitement. Mais au début de l'entraînement, l'erreur OOM se bloque immédiatement et le périphérique utilisé est device 0(la même erreur apparaît également si nous utilisons les 3 périphériques car je pense que le plus petit GPU sort de la mémoire et met fin à l'ensemble de l'entraînement). Il semble que le traitement distribué prenne automatiquement les premiers nGPU où nest spécifié dans le gpu.yamlfichier.

✔️ Bien que nous soyons en mesure d'entraîner l'architecture sur notre ensemble de données en utilisant les 3 périphériques GPU en définissant devices: 3mais /configs/trainer/gpu.yaml... ❌ ... mais nous ne parvenons pas à trouver un moyen de spécifier des périphériques GPU spécifiques pour l'entraînement (par exemple cuda:2uniquement ou cuda:1et cuda:2). Parce que parfois, des GPU spécifiques sur le serveur sont utilisés par d'autres étudiants dans notre laboratoire de recherche.

Hello, I have the same problem, please where did you do this 'xy_tiling=null' i.e., in which file?

zeejja commented 1 month ago

C'était vraiment utile @drprojects! L'installation xy_tiling=nulla vraiment résolu les erreurs lors du traitement, et nous sommes maintenant en mesure de terminer le traitement en deux étapes. Mais nous sommes confrontés aux problèmes suivants :

👉 Impossible de changer le périphérique d'entraînement (GPU)

J'ai un serveur qui possède 3 périphériques GPU (avec RAM : 40 Go, 80 Go et 80 Go). Par défaut, le traitement et l'entraînement utilisent tous les deux le premier GPU. Mais cela a conduit à une erreur de segmentation, j'ai donc dû changer le périphérique cuda:2en /configs/datamodule/my_data.yaml. Cela évite l'erreur de segmentation pendant le traitement. Mais au début de l'entraînement, l'erreur OOM se bloque immédiatement et le périphérique utilisé est device 0(la même erreur apparaît également si nous utilisons les 3 périphériques car je pense que le plus petit GPU sort de la mémoire et met fin à l' ensemble d'entraînement). Il semble que le traitement distribué prenne automatiquement les premiers nGPU où nest spécifié dans le gpu.yamlfichier. ✔️ Bien que nous soyons en mesure d'entraîner l'architecture sur notre ensemble de données en utilisant les 3 périphériques GPU en définissant devices: 3plus /configs/trainer/gpu.yaml... ❌ ... mais nous ne parvenons pas à trouver un moyen de spécifier des périphériques GPU spécifiques pour l'entraînement (par exemple cuda:2uniquement ou cuda:1et cuda:2). Parce que parfois, des GPU spécifiques sur le serveur sont utilisés par d'autres étudiants dans notre laboratoire de recherche.

Bonjour, j'ai le même problème, s'il vous plaît où avez-vous fait cela 'xy_tiling=null' c'est à dire dans quel fichier ?

@drprojects Thank you for your efforts. Please, if you can give me the file where I can put this 'xy_tiling=null' to disable tiling, Just to speed up the work.