drprojects / superpoint_transformer

Official PyTorch implementation of Superpoint Transformer introduced in [ICCV'23] "Efficient 3D Semantic Segmentation with Superpoint Transformer" and SuperCluster introduced in [3DV'24 Oral] "Scalable 3D Panoptic Segmentation As Superpoint Graph Clustering"
MIT License
546 stars 71 forks source link

Full resolution position #78

Closed Yarroudh closed 4 months ago

Yarroudh commented 5 months ago

Hello @drprojects, Is there any way to recover the full resolution positions, so i can store the results as PLY file ?

drprojects commented 5 months ago

Hi @Yarroudh, as of now, we only provide the full-resolution outputs for now and do not keep track of the full-resolution positions in the pipeline, for disk memory reasons. Indeed, we assume the user can still read the raw file herself if the positions or other full-resolution input attributes are needed, as saving these along with preprocessed data would only duplicate them in disk space. PRs are welcome if you think this feature would be useful 😉

Yarroudh commented 5 months ago

I get it now. I was doubting if the full resolution results are in the same order of vertices of the raw file. I will try this. Thanks.

Yarroudh commented 5 months ago

However, I'm getting this error when I try to get the full resolution:

assert super_index_raw_to_level0 is not None or sub_level0_to_raw is not None, 
\ AssertionError: Must provide either `super_index_raw_to_level0` or `sub_level0_to_raw`

when I checked the nag[0].sub, its value is None. Any suggestions regarding this error ? @drprojects

JJrodny commented 5 months ago

Hey @Yarroudh: @gvoysey and I figured this one out: in your yaml file you created in configs/datamodules/ (e.g. configs/datamodules/custom_dataset.yaml) add sub as a parameter in the preprocessing transform:

pre_transform:
    - transform: SaveNodeIndex
      params:
        key: 'sub'
Yarroudh commented 5 months ago

Hello @JJrodny, Thanks for replying to my issue. I already have this parameter in my data configuration file, but I still get that error.

Yarroudh commented 5 months ago

I found a solution for this error. I added this to my datamodule yaml file: load_full_res_idx: True, and now it works.

drprojects commented 5 months ago

However, I'm getting this error when I try to get the full resolution:

assert super_index_raw_to_level0 is not None or sub_level0_to_raw is not None, 
\ AssertionError: Must provide either `super_index_raw_to_level0` or `sub_level0_to_raw`

when I checked the nag[0].sub, its value is None. Any suggestions regarding this error ? @drprojects

This is intended behavior, as detailed in the demo notebook, which shows that datamodule.load_full_res_idx=True needs to be set in the config for loading full-resolution indices.

Yarroudh commented 5 months ago

Thanks for your response @drprojects. However, the full resolution prediction gives me a Tensor of size torch.Size([978509]). The sampled data contains 590701 points, and the original data exactly 1663621. I have been trying to figure out why the full resolution results are not equal to the size of my raw data.

Data(sub=Cluster(num_clusters=590701, num_points=978509, device=cuda:0), super_index=[590701], y=[590701, 9], pos=[590701, 3], elevation=[590701, 1], intensity=[590701], linearity=[590701, 1], planarity=[590701, 1], pos_offset=[3], scattering=[590701, 1], verticality=[590701, 1], x=[590701, 6], semantic_pred=[590701])

The number of points is reduced to 978509 ?

drprojects commented 5 months ago

Has your dataset been tiled ? ie datamodule.xy_tiling ≠ None or datamodule.pc_tiling ≠ None

Yarroudh commented 5 months ago

I do have xy_tiling: 1. Could this be the source of the problem ? I have set this value to 1 because otherwise I get this error:

Traceback (most recent call last):
  File "src/predict.py", line 177, in main
    predict(cfg)
  File "/home/anass/superpoint_transformer/src/utils/utils.py", line 48, in wrap
    raise ex
  File "/home/anass/superpoint_transformer/src/utils/utils.py", line 45, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
  File "src/predict.py", line 58, in predict
    datamodule.prepare_data()
  File "/home/anass/superpoint_transformer/src/datamodules/base.py", line 144, in prepare_data
    self.dataset_class(
  File "/home/anass/superpoint_transformer/src/datasets/base.py", line 223, in __init__
    super().__init__(root, transform, pre_transform, pre_filter)
  File "/home/anass/miniconda3/envs/spt/lib/python3.8/site-packages/torch_geometric/data/in_memory_dataset.py", line 57, in __init__
    super().__init__(root, transform, pre_transform, pre_filter, log)
  File "/home/anass/miniconda3/envs/spt/lib/python3.8/site-packages/torch_geometric/data/dataset.py", line 97, in __init__
    self._process()
  File "/home/anass/superpoint_transformer/src/datasets/base.py", line 647, in _process
    self.process()
  File "/home/anass/superpoint_transformer/src/datasets/base.py", line 682, in process
    self._process_single_cloud(p)
  File "/home/anass/superpoint_transformer/src/datasets/base.py", line 710, in _process_single_cloud
    nag = self.pre_transform(data)
  File "/home/anass/miniconda3/envs/spt/lib/python3.8/site-packages/torch_geometric/transforms/compose.py", line 24, in __call__
    data = transform(data)
  File "/home/anass/superpoint_transformer/src/transforms/transforms.py", line 23, in __call__
    return self._process(x)
  File "/home/anass/superpoint_transformer/src/transforms/graph.py", line 656, in _process
    nag = _horizontal_graph_by_radius(
  File "/home/anass/superpoint_transformer/src/transforms/graph.py", line 760, in _horizontal_graph_by_radius
    nag = _horizontal_graph_by_radius_for_single_level(
  File "/home/anass/superpoint_transformer/src/transforms/graph.py", line 805, in _horizontal_graph_by_radius_for_single_level
    raise ValueError(
ValueError: Input NAG only has 1 node at level=3. Cannot compute radius-based horizontal graph.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
drprojects commented 5 months ago

mmh it seems strange that setting xy_tiling: 1 versus xy_tiling: null would address the error you encounter. Isn't it rather that you had xy_tiling > 1 before ?

In any case, having xy_tiling: 1 should basically not tile at all. So I don't see why your raw data and nag[0].sub.num_points differ. Can you try with explicitly setting datamodule.xy_tiling=None or datamodule.pc_tiling=None ?

Yarroudh commented 5 months ago

Setting cfg.datamodule.xy_tiling = None does not resolve the issue, I still get nag[0] as follow :

Data(sub=Cluster(num_clusters=590701, num_points=978509, device=cuda:0), super_index=[590701], y=[590701, 9], pos=[590701, 3], elevation=[590701, 1], intensity=[590701], linearity=[590701, 1], planarity=[590701, 1], pos_offset=[3], scattering=[590701, 1], verticality=[590701, 1], x=[590701, 6], semantic_pred=[590701])
Yarroudh commented 5 months ago

Update I've noticed that the voxel size affects nag[0].num_points. I used a voxel size equal to 0.02 to enhance my training results and detect the small object. However, this results in the mismatch between the number of points in raw data and full resolution predictions. I set the voxel size to 0.1, 0.2 and 0.3, and noticed that the number of subs increases gruadually each time I increase the voxel size. Finally, with datamodule.voxel = 0.3, I could recover the whole points :

Data(sub=Cluster(num_clusters=57017, num_points=1663452, device=cuda:0), super_index=[57017], y=[57017, 9], pos=[57017, 3], elevation=[57017, 1], intensity=[57017], linearity=[57017, 1], planarity=[57017, 1], pos_offset=[3], scattering=[57017, 1], verticality=[57017, 1], x=[57017, 6], semantic_pred=[57017])

This is a strange behavior.

The full resolution predictions are not in the same order as the raw data vertices, thus it's not possible to match them.

drprojects commented 5 months ago

This sounds quite strange, there is no reason for the datamodule.voxel=0.3 to work and not the other resolutions... Changing datamodule.voxel should trigger a whole new preprocessing of all your dataset. Is that the case ?

Have you made any modification to the code ? A MRE + all code modifications would be needed. Although to be honest, I probably won't have time to investigate the matter before another 2 weeks.

Yarroudh commented 5 months ago

Yes, changing the voxel size started a new preprocessing of the dataset, but I managed to preprocess only the test data. However, as I mentioned, I could not match the vertices with the right prediction.

I did not modify the code, all the files I added are for my custom dataset.

Yarroudh commented 5 months ago

[UPDATE] As I could not match the full resolution predictions with the raw data, due to the different points order, I ended up using KNN search to find, for each point in raw data, the nearest neighbor in the sampled data, then attribute the prediction value to it. Not the most accurate solution that respects the grid sampling already done, but could be useful.

drprojects commented 5 months ago

As I could not match the full resolution predictions with the raw data, due to the different points order, I ended up using KNN search to find, for each point in raw data, the nearest neighbor in the sampled data, then attribute the prediction value to it. Not the most accurate solution that respects the grid sampling already done, but could be useful.

Yes, that's the workaround I would like to avoid because it slows down inference time. Keeping track of the indexing should avoid this costly operation.

As mentioned above, I will be out of office for the next 2 weeks. I may not have time to look into this until then, but I think I found the source of the error: there is a SampleSubNodes in the on_device_val_transform and on_device_test_transform of all datasets that breaks the indexing. These are used for subsampling voxels at inference time, this saves a bit of compute and memory but may not be essential for performance. I need to check whether this affects performance communicated in our papers before removing it.

On your end, you can just try removing those SampleSubNodes from your on_device_val_transform and on_device_test_transform (keep it in on_device_train_transform though !) and see if it solves the indexing issue.

meehirmhatrepy commented 4 months ago

for Kitti-360 dataset, In point_cloud_file, element vertex = 3503785. but len(raw_pano_y) = 3499706

in these type of cases how do i match index?

drprojects commented 4 months ago

@meehirmhatrepy I have not had time to work on fixing the present issue yet.

On your end, you can just try removing those SampleSubNodes from your on_device_val_transform and on_device_test_transform (keep it in on_device_train_transform though !) and see if it solves the indexing issue.

@Yarroudh have you tried the above-suggested fix to see if it solved the issue ?

meehirmhatrepy commented 4 months ago

I tried t

@meehirmhatrepy I have not had time to work on fixing the present issue yet.

On your end, you can just try removing those SampleSubNodes from your on_device_val_transform and on_device_test_transform (keep it in on_device_train_transform though !) and see if it solves the indexing issue.

@Yarroudh have you tried the above-suggested fix to see if it solved the issue ?

I tried removing SampleSubNodes from on_device_val_transform in kitti360.yaml. But still for full resolution i am not getting output for all points.

weilailoveL commented 4 months ago

Hello @drprojects , When I used the command python src/eval.py experiment=semantic/dales ckpt_path=logs/train/runs/2024-04-09_20-43-21/checkpoints/epoch_199.ckptto evaluate my training results on the DALES dataset, I encountered an error. File "/home/redamancy/anaconda3/envs/spt/lib/python3.8/site-packages/torch_geometric/nn/norm/graph_norm.py", line 64, in forward return self.weight * out / std + self.bias torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.70 GiB. GPU 0 has a total capacity of 15.74 GiB of which 1.80 GiB is free. Including non-PyTorch memory, this process has 13.85 GiB memory in use. Of the allocated memory 13.40 GiB is allocated by PyTorch, and 230.80 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace. Testing DataLoader 0: 36%|█████████████████████████████████████████
How can I solve this problem?

drprojects commented 4 months ago

@weilailoveL your comment is unrelated, please stick to the topic, this is not the first time you are told so here. Besides, please open a new issue only after making sure the answer is not already in the README, documentation, or past issues.

weilailoveL commented 4 months ago

@drprojects Thank you for your reply. I apologize for this. As a newcomer to deep learning, I was so eager to know how to solve this problem that I overlooked it. Once again, I apologize for any inconvenience caused

Yarroudh commented 4 months ago

@meehirmhatrepy I have not had time to work on fixing the present issue yet.

On your end, you can just try removing those SampleSubNodes from your on_device_val_transform and on_device_test_transform (keep it in on_device_train_transform though !) and see if it solves the indexing issue.

@Yarroudh have you tried the above-suggested fix to see if it solved the issue ?

No, I did not work on that as I went on a holiday. Once I'm back at office, I will try the suggested solution.

drprojects commented 4 months ago

@Yarroudh I cannot reproduce the error your are encountering. In my case:

nag[0].sub.num_points.item() == raw_semseg_y.numel()
>>> True

Have you made any modifications to the code / configs ? Are you using a custom dataset or one the of provided ones ? Could you share a MRE ?

MaximeROUBY commented 4 months ago

Hi, I had the same issue with my custom dataset and removing the SampleSubNodes worked for me. Thanks for this answer.

As I could not match the full resolution predictions with the raw data, due to the different points order, I ended up using KNN search to find, for each point in raw data, the nearest neighbor in the sampled data, then attribute the prediction value to it. Not the most accurate solution that respects the grid sampling already done, but could be useful.

Yes, that's the workaround I would like to avoid because it slows down inference time. Keeping track of the indexing should avoid this costly operation.

As mentioned above, I will be out of office for the next 2 weeks. I may not have time to look into this until then, but I think I found the source of the error: there is a SampleSubNodes in the on_device_val_transform and on_device_test_transform of all datasets that breaks the indexing. These are used for subsampling voxels at inference time, this saves a bit of compute and memory but may not be essential for performance. I need to check whether this affects performance communicated in our papers before removing it.

On your end, you can just try removing those SampleSubNodes from your on_device_val_transform and on_device_test_transform (keep it in on_device_train_transform though !) and see if it solves the indexing issue.

drprojects commented 4 months ago

Hi @MaximeROUBY, thanks for confirming this ! :pray:

I came to the same conclusion on my end. So I just updated the repo so the datamodule configs do not call SampleSubNodes in the validation and test transforms anymore. For anyone wondering, this is a minor change that will not affect semantic/panoptic segmentation performance, but will (only marginally) increase the memory/compute cost of inference, for the sake of maintaining full-resolution indexing.

I consider this issue solved and am closing it.

Wind010321 commented 2 months ago

Hi, I want to figure that in the latest version, what I can keep the original point's number during the preprocessed time is that:

  1. I can change the value of "voxel" in the corresponding datamodule yaml file.

Is there any other possible methods in the latest version of the code? Thank you a lot!

drprojects commented 2 months ago

@Wind010321 I do not understand your question, can you please clarify a bit ?

Wind010321 commented 2 months ago

@Wind010321 I do not understand your question, can you please clarify a bit ?

Such as: In the point cloud raw file, the number of the points is 10000, but the nag[0].shape maybe less than 10000. In other words, the default preprocessing method may reduce the return number of the points. I find that in other people's questions there is a solution to keep the number by adjusting the value of the "datamodule:voxel".

drprojects commented 2 months ago

Yes, there is a voxelization step in the preprocessing pipeline. See GridSampling3D in the datamodule config. For full-resolution inference, see the README:

Full-resolution predictions

By design, our models only need to produce predictions for the superpoints of the $P_1$ partition level during training. All our losses and metrics are formulated as superpoint-wise objectives. This conveniently saves compute and memory at training and evaluation time.

At inference time, however, we often need the predictions on the voxels of the $P_0$ partition level or on the full-resolution input point cloud. To this end, we provide helper functions to recover voxel-wise and full-resolution predictions.

See our demo notebook for more details on these.

Wind010321 commented 2 months ago

I see, thank you very much!