Reproducibility - Githubissues

tchaton commented 4 years ago

Dear @erikwijmans ,

We are currently trying to reproduce all the SOTA models in Pytorch within a nicely wrapper framework.

Here is the link of the repo: https://github.com/nicolas-chaulet/deeppointcloud-benchmarks

We are integrating your implementation within the framework. However, we are getting trouble with the gradients and also it is slightly slower than yours.

I was wondering how autograd is handled within torch extensions. My intuition is the following.

If you have a function_name, you need to add {function_name}_grad implementation which compute the gradients and torch will find it automatically. Is it correct ? We have put the kernels within the following repo: https://github.com/nicolas-chaulet/torch-points to be easier to access and contained.

Do you have an intuition what could be the trouble with the gradients ?

  File "/home/thomas/HELIX/research/deeppointcloud-benchmarks/models/base_model.py", line 73, in optimize_parameters
    self.backward()              # calculate gradients
  File "/home/thomas/HELIX/research/deeppointcloud-benchmarks/models/pointnet2_customkernel/nn.py", line 61, in backward
    self.loss_seg.backward()
  File "/home/thomas/.cache/pypoetry/virtualenvs/superpoint-graph-job-py3.6/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/thomas/.cache/pypoetry/virtualenvs/superpoint-graph-job-py3.6/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorScatterGather.cu:194

   self.loss_seg = F.nll_loss(self.output, self.labels)
  File "/home/thomas/.cache/pypoetry/virtualenvs/superpoint-graph-job-py3.6/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/thomas/.cache/pypoetry/virtualenvs/superpoint-graph-job-py3.6/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THC/THCTensorScatterGather.cu:130: void THCudaTensor_scatterKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = float, Dims = 2]: block: [0,0,0], thread: [0,0,0] Assertion `indexValue >= 0 && indexValue < tensor.sizes[dim]` failed.

erikwijmans commented 4 years ago

I was wondering how autograd is handled within torch extensions.

Registering the extensions with autograd is done via a torch.autograd.Function, like here: https://github.com/nicolas-chaulet/torch-points/blob/master/torch_points/torchpoints.py#L40. You can read through this tutorial: https://pytorch.org/tutorials/advanced/cpp_extension.html

Backward pass debugging

First thing to do is enable CUDA_LAUNCH_BLOCKING via export CUDA_LAUNCH_BLOCKING=1. That'll help localize the failure.

The device-side assert seems to be throwing a runtime error, so https://pytorch.org/docs/stable/autograd.html#anomaly-detection may work and would help you to localize the error further.

Btw, using F.nll_loss (and I assume there is a log_softmax elsewhere) for classification is not recommended, using https://pytorch.org/docs/stable/nn.functional.html?highlight=cross_entropy#torch.nn.functional.cross_entropy provides better numerical stability.

tchaton commented 4 years ago

Hey @erikwijmans,

Thanks for answering me :)

Here is the error I am getting.

CUDA kernel failed : an illegal memory access was encountered
void three_interpolate_kernel_wrapper(int, int, int, int, const float*, const int*, const float*, float*) at L:110 in cuda/src/interpolate_gpu.cu

After looking deeper. I figured out where the bug was coming from. In my implementation, I had an innermost layer which was performing a global_max_pooling. When during three_interpolate_kernel_wrapper back to 1 -> N, the kernel was accessing illegal memory By removing this layer, it seemed to work, but it is twice slower and doesn't seem to train as well. I need to dig more into it :)

Best, Thomas Chaton.

tchaton commented 4 years ago

Dear @erikwijmans,

It is pretty fascinating. I have exactly the same model as yours with Pytorch Geometric dataloader, yet it is twice longer to train on s3dis.

Do you have any ideas where it might come from ?

Here is the code of the models: https://github.com/nicolas-chaulet/deeppointcloud-benchmarks/blob/pn2/models/pointnet2_customkernel/nn.py and here is the dataset https://github.com/nicolas-chaulet/deeppointcloud-benchmarks/blob/pn2/datasets/s3dis_dataset.py

My opinion is that pin_memory is providing a strong boost, but Data / Batch Object object from Pytorch Geometric doesn't handle it properly. I added the function internally, and it gets down from 0.8 iteration to 0.65, when your code is around 0.35.

But it could be something else entirely.

Best, Thomas Chaton

erikwijmans commented 4 years ago

torch.backends.cudnn.enabled = False may be doing it.

tchaton commented 4 years ago

Hey @erikwijmans, I think you might be write. I removed it and I had a speed up

Which is pretty interesting as I don't have cudnn installed.

But yours is still faster. I have observed the gpu usage, and I am between 60-85 % when yours is always at 99-100%. I think it is coming from pin_memory which doesn't work well with PyG data objects. I am going to try out without out it to check out.

Also, I have tried https://github.com/sshaoshuai/Pointnet2.PyTorch code by Shaoshuai Shi which is claiming to have faster kernels. I started to benchmark them: https://github.com/tchaton/torch-points/blob/master/test/test_speed.py Only one, but was slightly slower.

Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 8.3.0-6ubuntu1~18.04.1) 8.3.0
CMake version: version 3.10.2

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce RTX 2060
Nvidia driver version: 418.67
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] etw-pytorch-utils==1.1.1
[pip3] numpy==1.18.0
[pip3] numpy-indexed==0.3.5
[pip3] torch==1.1.0
[pip3] torch-cluster==1.4.5
[pip3] torch-geometric==1.3.2
[pip3] torch-nearest-neighbors==0.0.0
[pip3] torch-points==0.1.2
[pip3] torch-scatter==1.4.0
[pip3] torch-sparse==0.4.3
[pip3] torchfile==0.1.0
[pip3] torchnet==0.0.4
[pip3] torchvision==0.2.0
[conda] Could not collect

erikwijmans commented 4 years ago

I think it is coming from pin_memory which doesn't work well with PyG data objects. I am going to try out without out it to check out.

Yeah, pin memory can make a big difference.

Also, I have tried https://github.com/sshaoshuai/Pointnet2.PyTorch code by Shaoshuai Shi which is claiming to have faster kernels.

I saw that repo at one point. The only difference I saw is that they use a thread per unit of work needed to be done (where the unit of work is what would be done in the inner most for loop if you think about implementing these ops as a nested loop). In my experience, that is almost always slower unless the unit of work is huge (which it isn't for all the PointNet++ ops) as there is non-negligible overhead in spawning and managing threads. With that said, tuning CUDA kernels is a dark art and is highly system and workload dependent, so this method may indeed be faster for them or the balanced I tried to strike with these kernels is wrong for them.

tchaton commented 4 years ago

Hey @erikwijmans,

Here are the results I get after 100 epoch: https://github.com/nicolas-chaulet/deeppointcloud-benchmarks/blob/master/benchmark/s3dis_fold5/Pointnet2_original.md

BEST: 
* loss_seg: 0.051259834319353104
* acc: 85.26667395303411
* miou: 45.583527852040845
* macc: 73.11160574504926

albertotono commented 3 years ago

Dear @tchaton does your implementation take in consideration the different architecture in the segmentation part from the original one in Tensorflow? see reference here https://github.com/erikwijmans/Pointnet2_PyTorch/issues/66 ?

erikwijmans / Pointnet2_PyTorch

Reproducibility #85