HuguesTHOMAS / KPConv-PyTorch

Kernel Point Convolution implemented in PyTorch
MIT License
783 stars 155 forks source link

Floating point exception (core dumped) #89

Open YuXing3d opened 3 years ago

YuXing3d commented 3 years ago

Hi @HuguesTHOMAS,

I utilized your KPConv (pytorch version) to deal with a customized dataset. The bug "Floating point exception (core dumped)" always occured during the iteration of validation dataloader. As in the figure below, the bug occured after the 287th iterations but the max iteration number is actually 500. 287 is not fixed. Sometimes it could be other numbers. Please just ignore "pass 1/2/3". They are printed for debugging.

I think this bug could be caused by a problem in your dataloader codes because I have excluded all of the problems in my own codes. Have you ever met it? Do you konw what could cause this bug?

image

Look forward to your reply. Thank you in advance!

Best regards, Yuxing

YuXing3d commented 3 years ago

This is my current code of def cloud_segmentation_validation(self, net, val_loader, config, debug=True). Even when the code is simplified as below, the bug still exists. That is why I suspect it was due to someting in dataloader. Hope it helps you to understand my meaning.

    # Choose validation smoothing parameter (0 for no smothing, 0.99 for big smoothing)
    val_smooth = 0.95
    softmax = torch.nn.Softmax(1)

    # Number of classes including ignored labels
    nc_tot = val_loader.dataset.num_classes

    # Number of classes predicted by the model
    nc_model = config.num_classes - len(self.ignored_labels)

    #####################
    # Network predictions
    #####################
    # Start validation loop
    print("validation size:" + str(len(val_loader)))
    for i, batch in enumerate(val_loader):
        print("Times: " + str(i))
        print("pass 1")

        if 'cuda' in self.device.type:
            batch.to(self.device)

        # Forward pass
        net.eval()
        outputs = net(batch, config)

        print("pass 2")

        print("pass 3")

    return
HuguesTHOMAS commented 3 years ago

You are right, the exception surely comes from the dataloader. If the bug occurs at a random epoch, it is possible that it is caused by an aberrant input (for example a point cloud with zeros points). I would advise printing debugging messages in the __getitem__function of the dataset you are using. For example, if it was SemanticKitti: https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/datasets/SemanticKitti.py#L200

You can focus your debugging messages around the c++ wrappers called during this function:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/datasets/SemanticKitti.py#L346-L349

and

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/7fefb6a8d38fd304775199777ad01d9f1546e2ff/datasets/SemanticKitti.py#L459-L462

Good luck with your issue, don't hesitate to post again here if you have more debugging information or if you resolved the bug.

Best, Hugues

PPPPeterpan commented 3 years ago

Hi @YuXing3d, did you resolve the bug? I got the same error during the training process.

hadilou commented 3 years ago

Hi @HuguesTHOMAS . I had the same error on a custom dataset of urban scenes where I use a subsampling of 30cm and input spheres of 12m. This setting gave me the best results with the TF implementation. I traced the error back to the regular picking function. The input spheres were empty at some point during training. Then, I realised you have been using coarser trees (subsampling of which is one tenth of input sphere) with the potentials. This seems to be the problem because when I use the normal input trees to update the potentials instead of the coarser input trees, the problem disapeared. What was the idea behing the use of the coarser trees which wasn't in the tf implementation?

HuguesTHOMAS commented 3 years ago

Actually, the coarser trees were in the TF implementation too. The idea is that potentials are here to ensure regular picking of input spheres across the dataset. Because spheres are very big, we do not need to keep track of potentials on every single data point, having just a subset of them is sufficient. Therefore I use coarser trees for faster implementation. Did you notice a significant difference between training time with and without coarser trees?

hadilou commented 3 years ago

Hi, Thank You for your answer.

Actually, the coarser trees were in the TF implementation too.

Can you link to the TF code where coarser trees are used (for S3DIS), I am not able to find it.

Did you notice a significant difference between training time with and without coarser trees?

I can't compare because of the error mentioned in my previous comment but it makes sense it will make the implementation faster.

I have a question regarding the break condition of the inference loop here

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/73e444d486cd6cb56122c3dd410e51c734064cfe/utils/tester.py#L460-L463

How can I relate the num_votes to the minimum potential because I realized the testing lasts very long in my case. How to choose a good value for num_votes

HuguesTHOMAS commented 3 years ago

Can you link to the TF code where coarser trees are used (for S3DIS), I am not able to find it.

Actually, you are right, I cannot find them, it might be that I coded them in my development repo, but did not upload them on the public repo

Can you link to the TF code where coarser trees are used (for S3DIS), I am not able to find it.

It is just a dummy condition, that ensures we stop only when every part of the test cloud has been seen enough times so that the predicted class probabilities are averaged properly. In practice I usually let the test code running for a while until I see that the results remain stable. If you don't have time, you can stop when last_min reaches a value of 1, to be sure that every point gets to be seen at least one time.

hadilou commented 3 years ago

Hi @HuguesTHOMAS. thanks for your previous insights. I noticed some performance drop between the TF and Pytorch implementations. Do you have any insights on where to look at, or what might be the possible reasons? Thanks

HuguesTHOMAS commented 3 years ago

Well actually, it depends, I noticed performance drops on some datasets too, but on S3DIS, I noticed a performance increase... SO I don't know what to think of TF vs PyTorch implementation