Dimension Error - Githubissues

AlanKoschel commented 2 years ago

Dear @HuguesTHOMAS , first of all, thank you very much for your implementaion of KPConv. I am using the network to train on colored point clouds, 3D reconstructed from drone images. The training, validation and testing works very well, but as soon as I am setting batch_num=1 I encountered 2 errors:

First one:

Traceback (most recent call last):
  File "train_SVGEO.py", line 324, in <module>
    trainer.train(net, training_loader, test_loader, config)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/utils/trainer.py", line 200, in train
    outputs = net(batch, config)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/architectures.py", line 345, in forward
    x = block_op(x, batch)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/blocks.py", line 636, in forward
    x = self.leaky_relu(self.batch_norm_conv(x))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/blocks.py", line 457, in forward
    x = self.batch_norm(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 178, in forward
    self.eps,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2279, in batch_norm
    _verify_batch_size(input.size())
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2247, in _verify_batch_size
    raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1])

Second one during validation:

Traceback (most recent call last):
  File "train_SVGEO.py", line 324, in <module>
    trainer.train(net, training_loader, test_loader, config)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/utils/trainer.py", line 283, in train
    self.validation(net, val_loader, config)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/utils/trainer.py", line 299, in validation
    self.cloud_segmentation_validation(net, val_loader, config)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/utils/trainer.py", line 487, in cloud_segmentation_validation
    outputs = net(batch, config)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/architectures.py", line 345, in forward
    x = block_op(x, batch)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/blocks.py", line 639, in forward
    x = self.unary2(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/blocks.py", line 494, in forward
    x = self.batch_norm(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/user/KPConv-PyTorch/Experiments/KPConv-PyTorch/models/blocks.py", line 455, in forward
    x = x.unsqueeze(2)
IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

I am curious about what is happening there. In another Issue you mentioned that you recommend training only with batch_num>=3, so the only reason why I train with one batch is because I want to investigate it's learning behaviour. Training another network with 1 batch per iteration I encountered that the nework is learning nothing. So I wanted to see if KPConv exhibits the same and that it is due to the batch size.

Thanks in advance!

Edit: Both errors occur random in different epochs and iterations each time.

HuguesTHOMAS commented 2 years ago

Hi @AlanKoschel,

I think I suspect what is going on here. In the batch norm function, I use a squeeze function to get rid of unnecessary dimensions. This means that if the input point cloud batch contains only one point, there is a bug as this dimension is squeezed too.

Could you print the dimension of your batch.points tensors (for each layer). If it happens to be [1, 3] at any layer, then you have your culprit.

There could be a way to fix this squeeze function so there is no more error thrown (using reshape instead). But I don't think it should be corrected, as batch normalization is not supposed to be used on a single element. In your case, I suggest not using batch norm for your experiment, which if I understood is for debugging purpose anyway. See the parameter:

https://github.com/HuguesTHOMAS/KPConv-PyTorch/blob/73e444d486cd6cb56122c3dd410e51c734064cfe/train_S3DIS.py#L151-L152

AlanKoschel commented 2 years ago

Hi @HuguesTHOMAS , thanks for your detailed explanation, I will check that soon!

HuguesTHOMAS / KPConv-PyTorch

Dimension Error #129