HuguesTHOMAS / KPConv-PyTorch

Kernel Point Convolution implemented in PyTorch
MIT License
785 stars 155 forks source link

KPConv parameters for large outdoor scene with low point density #90

Closed meidachen closed 3 years ago

meidachen commented 3 years ago

First of all, thanks a lot for this amazing open-sourced code. I have played around with your code a little bit using the provided dataset. Now I really want to train a model using my own data sets. My data sets were created using aerial images and photogrammetry. I have 24 of these kinds of datasets that were annotated for only three classes (ground, buildings, and vegetation). The difference between my datasets and these existing ones is that my data has low point density (0.3-meter point spacing), each data covers a larger area of interest (around 1 square km for each area I have collected). Could you please help me on defining the KPConv parameters that you think maybe reasonable to produce some good results? If I understand correctly, I may need to modify all of the following parameters.

in_radius = 6.0 num_kernel_points = 15 conv_radius = 5.0 deform_radius = 6.0 KP_extent = 1.2

Thank you in advance for your help!! And thanks again for this amazing work!

HuguesTHOMAS commented 3 years ago

Hi @meidachen,

When changing the dataset there are only two parameters that really count: in_radius and first_subsampling_dl. The first parameter that you want to set is first_subsampling_dl. If you want good results, you should have it set the lowest possible, but it is useless for it to be lower than the minimum point spacing in your dataset. So in your case, set it to 0.3 meters. Then the rule of thumb is to have in_radius approximatively 50 times bigger than first_subsampling_dl. In your case, it would thus be 15 meters. You can play with in_radius, a lower value means smaller input and thus faster computing time or/and bigger batch size. A bigger value means the network has more contextual information in the input spheres. Depending on the nature of your dataset, the ideal value varies.

I advise you to keep the default value for the following parameters, which define the core architecture, independently from the data:

num_kernel_points = 15 conv_radius = 2.5 deform_radius = 6.0 KP_extent = 1.2

Other parameters that you may want to modify are:

batch_num = 6 # If your GPU can handle it, make the batch size bigger for better convergence.

lr_decays = {i: 0.1 ** (1 / 150) for i in range(1, max_epoch)} # (150 is the number of epoch for a learning rate decay of 0.1, reduce it for faster convergence but be careful, if it is to low, your network convergence will stop too early)

Eventually, you need to handle the input features:

in_features_dim = 1 + the number of features available for your data points. (for example, colors=3, intensity=1)

Then go to:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/datasets/S3DIS.py#L402-L411

And implement the features you want to have there

meidachen commented 3 years ago

Really appreciate your suggestion and detailed explanation! I have trained a model with the following parameters and was able to get a great result (miou over 85% for all three classes, again, amazing!!) I was doing this in a very brute force way for training and testing, I basically prepared all my point clouds in the .ply format and put them into the Data\S3DIS\original_ply folder and used your S3DIS code to prepare/train/test. I'm wondering do you think if this is ok or there are any issues you may think could potentially affect the results I currently have? Or do you think there is any way I could potentially improve the performance?

in_radius = 15.0
num_kernel_points = 15
first_subsampling_dl = 0.3
conv_radius = 2.5
deform_radius = 6.0
KP_extent = 1.2

max_epoch = 300
learning_rate = 1e-2
momentum = 0.98
lr_decays = {i: 0.1 ** (1 / 100) for i in range(1, max_epoch)}
grad_clip_norm = 100.0

batch_num = 6
epoch_steps = 200

I would really like to include your kpconv into our workflow (generating game environment/simulation using UAV images), and have a few questions which I hope you could help me with. 1) I saw your original implementation performed better on Semantic3D by not using the deformable layer, should I follow that and use the 'resnetb_strided' instead of 'resnetb_deformable' and 'resnetb_deformable_strided' in your pytorch implementation? 2) In your original paper, under the "Pipeline for real scene segmentation" section, you mentioned "At testing, we pick spheres regularly in the point clouds but ensure each point is tested multiple times by different sphere locations." but when I try to test the model I didn't find how to "pick spheres regularly" (I guess the spheres are the "potentials" here, and they are still randomly selected in the validation data set). And I saw that config.validation_size is a hardcoded number, in my case, if I'm using a small number it may not cover the entire area and it will run forever for a large dataset, and if I'm using a large number it may also run for a very long time for a small data set. Could you please help me understand how should I make this work as you mentioned in the paper? Also I saw in test_models.py elif config.dataset == 'S3DIS': test_dataset = S3DISDataset(config, set='validation', use_potentials=True) test_sampler = S3DISSampler(test_dataset) collate_fn = S3DISCollate is this (set='validation') related to this problem I'm having? 3) I'm wondering if I don't want to use the color as the feature and just want to use the point location as the feature, should I put 1 for every point? or should I do something else? 4) Could you please point me to which part of the code I should change when I try to segment an unlabeled dataset? I'm confused a little bit here because originally I thought the data will be prepared during testing, but it seems like all the data were prepared during training, and when testing, it just goes into the validation set that was already prepared. I could be very wrong here, so please help.

Thanks again for your help! And looking forward to hearing back from you.

HuguesTHOMAS commented 3 years ago

Hi @meidachen,

Thanks for your kind words, I think what you did is quite good actually. For your type of data (large point clouds covering areas), the S3DIS pipeline is the most adapted. When you start getting good results like you are, the margin of improvement becomes smaller. I noticed that you can gain 1 or 2 mIoU points by optimizing the parameters to the best possible values. But at this point, getting 85% or 87 % won't change much for you. The improvement could be more significant if you add a new feature to the input. For example, in your type of data, adding the height above ground could be a game-changer (if you don't do it already).

Now concerning your questions:

  1. I would advise keeping the rigid convolution. It is much more simple, faster, and the results are not always worse. See that answer I gave previously.

  2. The input pipeline using potentials is explained in my PhD thesis section IV.2.8.c (p127-128). The validation is only here to keep track of the performances during convergence. You should adapt the validation size to the size of your dataset, do not make it too big, you don't want to lose too much time (10 times smaller than epoch_stepsis generally a good option). You can always use the script test_models.py afterward to test on the whole dataset. In test_models.py, we use set='validation' because there is no test set for S3DIS

  3. You should always put 1 for every point, even when you use colors. The rule is to have the constant features equal to one, and then more features, in your case I would advise using height above ground. Colors could also be helpful (especially for vegetation), so that that would make 5 features in total (or 2 if you don't use colors). Compare both and see what is the best in your case.

  4. As I said, S3DIS does not have an unlabeled test set, so it is not implemented in this dataset. Here you can see where the pointclouds are chosen depending on the set: https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/datasets/S3DIS.py#L112-L149 Change the code from line 139 to 149 to choose the unlabeled files for set=='test'

Then in test_models.py change set=='validation' to set=='test': https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/test_models.py#L169-L172

Eventually you may have to check if everything goes well in the test function : https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/utils/tester.py#L176

meidachen commented 3 years ago

Thank you for your reply! I'm wondering when you say "height above ground" do you actually mean segment the ground first and then get the actual height above the ground? or do you mean using the following feature which already implemented in your code?

input_features = np.hstack((input_colors, input_points[:, 2:] + center_point[:, 2:])).astype(np.float32)

I also tried to experiment with the S3DIS data and wanted to reproduce your result >74 MIOU but was not successful. Here are the parameters I have used, I tried to use the same as yours but my GPU doesn't have enough memory so I reduced the batch size and didn't use the deformable layer. I was getting an MIOU around 67. Could you please help to see what may be the reason for this low performance?

#########################
# Architecture definition
#########################

# Define layers
architecture = ['simple',
                'resnetb',
                'resnetb_strided',
                'resnetb',
                'resnetb',
                'resnetb_strided',
                'resnetb',
                'resnetb',
                'resnetb_strided',
                'resnetb',
                'resnetb',
                'resnetb_strided',
                'resnetb',
                'resnetb',
                'nearest_upsample',
                'unary',
                'nearest_upsample',
                'unary',
                'nearest_upsample',
                'unary',
                'nearest_upsample',
                'unary']

###################
# KPConv parameters
###################

# Radius of the input sphere
in_radius = 1.5

# Number of kernel points
num_kernel_points = 15

# Size of the first subsampling grid in meter
first_subsampling_dl = 0.03

# Radius of convolution in "number grid cell". (2.5 is the standard value)
conv_radius = 2.5

# Radius of deformable convolution in "number grid cell". Larger so that deformed kernel can spread out
deform_radius = 6.0

# Radius of the area of influence of each kernel point in "number grid cell". (1.0 is the standard value)
KP_extent = 1.2

# Behavior of convolutions in ('constant', 'linear', 'gaussian')
KP_influence = 'linear'

# Aggregation function of KPConv in ('closest', 'sum')
aggregation_mode = 'sum'

# Choice of input features
first_features_dim = 128
in_features_dim = 5

# Can the network learn modulations
modulated = False

# Batch normalization parameters
use_batch_norm = True
batch_norm_momentum = 0.02

# Deformable offset loss
# 'point2point' fitting geometry by penalizing distance from deform point to input points
# 'point2plane' fitting geometry by penalizing distance from deform point to input point triplet (not implemented)
deform_fitting_mode = 'point2point'
deform_fitting_power = 1.0              # Multiplier for the fitting/repulsive loss
deform_lr_factor = 0.1                  # Multiplier for learning rate applied to the deformations
repulse_extent = 1.2                    # Distance of repulsion for deformed kernel points

#####################
# Training parameters
#####################

# Maximal number of epochs
max_epoch = 500

# Learning rate management
learning_rate = 1e-2
momentum = 0.98
lr_decays = {i: 0.1 ** (1 / 150) for i in range(1, max_epoch)}
grad_clip_norm = 100.0

# Number of batch
batch_num = 4

# Number of steps per epochs
epoch_steps = 500

# Number of validation examples per epoch
validation_size = 50

# Number of epoch between each checkpoint
checkpoint_gap = 50

# Augmentations
augment_scale_anisotropic = True
augment_symmetries = [True, False, False]
augment_rotation = 'vertical'
augment_scale_min = 0.8
augment_scale_max = 1.2
augment_noise = 0.001
augment_color = 0.8

# The way we balance segmentation loss
#   > 'none': Each point in the whole batch has the same contribution.
#   > 'class': Each class has the same contribution (points are weighted according to class balance)
#   > 'batch': Each cloud in the batch has the same contribution (points are weighted according cloud sizes)
segloss_balance = 'none'

# Do we nee to save convergence
saving = True
saving_path = None

I also noticed that when I try to apply the trained model from S3DIS for the interior data I collected, it didn't work well. I think the main issue was that my data have multiple floors and not every floor has the ground at z = 0. Any horizontal plane on the second floor and above will be miss labeled as the ceiling (i.e., not only the floor becomes the ceiling but also things like the surface of a table will be labeled as the ceiling. I guess my question is should the data always be aligned with the training data on the same elevation? (for interior this is fine, but for an outdoor area, it may be hard to align the data especially if the terrain has slopes). I'm wondering is this because I used this feature to train the S3DIS? (input_features = np.hstack((input_colors, input_points[:, 2:] + center_point[:, 2:])).astype(np.float32) or are there anything in the network that was trained to look into the location feature of the point cloud (ps. I didn't find anything doing this in your paper)?

Thank you again for your help!

HuguesTHOMAS commented 3 years ago

I'm wondering when you say "height above ground" do you actually mean segment the ground first and then get the actual height above the ground? or do you mean using the following feature which already implemented in your code?

Whatever you feel is the best for your dataset. Sometimes assuming the ground is the lowest height value is enough. Sometimes you need a more robust ground extraction. You can first try with the z coordinates of your points (as I did) to see if that helps.

I also tried to experiment with the S3DIS data and wanted to reproduce your result >74 MIOU but was not successful. Here are the parameters I have used, I tried to use the same as yours but my GPU doesn't have enough memory so I reduced the batch size and didn't use the deformable layer. I was getting an MIOU around 67. Could you please help to see what may be the reason for this low performance?

Have another look at the paper, 74% mIoU is for Semantic3D. Actually, the best score I got for S3DIS was 67% mIoU, with deformable convolutions. So you perf is pretty high.

I think the main issue was that my data have multiple floors and not every floor has the ground at z = 0.

Indeed, that causes trouble exactly for the reason you explained. As S3DIS is always aligned at a single level, using the height feature made sense, but in your case, you should just remove it. Simply change in_features_dim = 4 in your configuration and that should be a lot better. As you said, this height feature should only be used in specific cases where we don't have slopes or levels.

are there anything in the network that was trained to look into the location feature of the point cloud (ps. I didn't find anything doing this in your paper)?

No, the network only looks at the features you give it. So if you don't give any location feature, the network will only focus on the geometric shapes and the color patterns (if you have colors).

Best, Hugues