Open tchaton opened 4 years ago
I was wondering how autograd is handled within torch extensions.
Registering the extensions with autograd is done via a torch.autograd.Function
, like here: https://github.com/nicolas-chaulet/torch-points/blob/master/torch_points/torchpoints.py#L40. You can read through this tutorial: https://pytorch.org/tutorials/advanced/cpp_extension.html
Backward pass debugging
First thing to do is enable CUDA_LAUNCH_BLOCKING
via export CUDA_LAUNCH_BLOCKING=1
. That'll help localize the failure.
The device-side assert seems to be throwing a runtime error, so https://pytorch.org/docs/stable/autograd.html#anomaly-detection may work and would help you to localize the error further.
Btw, using F.nll_loss
(and I assume there is a log_softmax elsewhere) for classification is not recommended, using https://pytorch.org/docs/stable/nn.functional.html?highlight=cross_entropy#torch.nn.functional.cross_entropy provides better numerical stability.
Hey @erikwijmans,
Thanks for answering me :)
Here is the error I am getting.
CUDA kernel failed : an illegal memory access was encountered
void three_interpolate_kernel_wrapper(int, int, int, int, const float*, const int*, const float*, float*) at L:110 in cuda/src/interpolate_gpu.cu
After looking deeper. I figured out where the bug was coming from. In my implementation, I had an innermost layer which was performing a global_max_pooling. When during three_interpolate_kernel_wrapper back to 1 -> N, the kernel was accessing illegal memory By removing this layer, it seemed to work, but it is twice slower and doesn't seem to train as well. I need to dig more into it :)
Best, Thomas Chaton.
Dear @erikwijmans,
It is pretty fascinating. I have exactly the same model as yours with Pytorch Geometric dataloader, yet it is twice longer to train on s3dis.
Do you have any ideas where it might come from ?
Here is the code of the models: https://github.com/nicolas-chaulet/deeppointcloud-benchmarks/blob/pn2/models/pointnet2_customkernel/nn.py and here is the dataset https://github.com/nicolas-chaulet/deeppointcloud-benchmarks/blob/pn2/datasets/s3dis_dataset.py
My opinion is that pin_memory is providing a strong boost, but Data / Batch Object object from Pytorch Geometric doesn't handle it properly. I added the function internally, and it gets down from 0.8 iteration to 0.65, when your code is around 0.35.
But it could be something else entirely.
Best, Thomas Chaton
torch.backends.cudnn.enabled = False
may be doing it.
Hey @erikwijmans, I think you might be write. I removed it and I had a speed up
Which is pretty interesting as I don't have cudnn installed.
But yours is still faster. I have observed the gpu usage, and I am between 60-85 % when yours is always at 99-100%. I think it is coming from pin_memory which doesn't work well with PyG data objects. I am going to try out without out it to check out.
Also, I have tried https://github.com/sshaoshuai/Pointnet2.PyTorch
code by Shaoshuai Shi which is claiming to have faster kernels.
I started to benchmark them: https://github.com/tchaton/torch-points/blob/master/test/test_speed.py
Only one, but was slightly slower.
Collecting environment information...
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 8.3.0-6ubuntu1~18.04.1) 8.3.0
CMake version: version 3.10.2
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 9.1.85
GPU models and configuration: GPU 0: GeForce RTX 2060
Nvidia driver version: 418.67
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] etw-pytorch-utils==1.1.1
[pip3] numpy==1.18.0
[pip3] numpy-indexed==0.3.5
[pip3] torch==1.1.0
[pip3] torch-cluster==1.4.5
[pip3] torch-geometric==1.3.2
[pip3] torch-nearest-neighbors==0.0.0
[pip3] torch-points==0.1.2
[pip3] torch-scatter==1.4.0
[pip3] torch-sparse==0.4.3
[pip3] torchfile==0.1.0
[pip3] torchnet==0.0.4
[pip3] torchvision==0.2.0
[conda] Could not collect
I think it is coming from pin_memory which doesn't work well with PyG data objects. I am going to try out without out it to check out.
Yeah, pin memory can make a big difference.
Also, I have tried https://github.com/sshaoshuai/Pointnet2.PyTorch code by Shaoshuai Shi which is claiming to have faster kernels.
I saw that repo at one point. The only difference I saw is that they use a thread per unit of work needed to be done (where the unit of work is what would be done in the inner most for loop if you think about implementing these ops as a nested loop). In my experience, that is almost always slower unless the unit of work is huge (which it isn't for all the PointNet++ ops) as there is non-negligible overhead in spawning and managing threads. With that said, tuning CUDA kernels is a dark art and is highly system and workload dependent, so this method may indeed be faster for them or the balanced I tried to strike with these kernels is wrong for them.
Hey @erikwijmans,
Here are the results I get after 100 epoch: https://github.com/nicolas-chaulet/deeppointcloud-benchmarks/blob/master/benchmark/s3dis_fold5/Pointnet2_original.md
BEST:
* loss_seg: 0.051259834319353104
* acc: 85.26667395303411
* miou: 45.583527852040845
* macc: 73.11160574504926
Dear @tchaton does your implementation take in consideration the different architecture in the segmentation part from the original one in Tensorflow? see reference here https://github.com/erikwijmans/Pointnet2_PyTorch/issues/66 ?
Dear @erikwijmans ,
We are currently trying to reproduce all the SOTA models in Pytorch within a nicely wrapper framework.
Here is the link of the repo: https://github.com/nicolas-chaulet/deeppointcloud-benchmarks
We are integrating your implementation within the framework. However, we are getting trouble with the gradients and also it is slightly slower than yours.
I was wondering how autograd is handled within torch extensions. My intuition is the following.
If you have a function_name, you need to add {function_name}_grad implementation which compute the gradients and torch will find it automatically. Is it correct ? We have put the kernels within the following repo: https://github.com/nicolas-chaulet/torch-points to be easier to access and contained.
Do you have an intuition what could be the trouble with the gradients ?