NVIDIA / MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
https://nvidia.github.io/MinkowskiEngine
Other
2.43k stars 360 forks source link

CUDA error: misaligned address #488

Open ZhouMengjie opened 2 years ago

ZhouMengjie commented 2 years ago

Hi, I am using MinkowskiEngine 0.5.4 to build my network. When I use larger batch size, e.g. 32, 64 or 128, an CUDA error: misaligned address happens. The detail of this error is showed below:

File "training/train.py", line 56, in do_train(dataloaders, train_sampler, params, debug=args.debug) File "/mnt/lustre/zhoumengjie/Image-to-2-5DMap/training/trainer_backup.py", line 230, in do_train loss.backward() File "/mnt/lustre/zhoumengjie/.conda/envs/zmj-mink/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/mnt/lustre/zhoumengjie/.conda/envs/zmj-mink/lib/python3.8/site-packages/torch/autograd/init.py", line 130, in backward Variable._execution_engine.run_backward( File "/mnt/lustre/zhoumengjie/.conda/envs/zmj-mink/lib/python3.8/site-packages/torch/autograd/function.py", line 89, in apply return self._forward_cls.backward(self, *args) # type: ignore File "/mnt/lustre/zhoumengjie/.conda/envs/zmj-mink/lib/python3.8/site-packages/MinkowskiEngine-0.5.4-py3.8-linux-x86_64.egg/MinkowskiEngine/MinkowskiBroadcast.py", line 87, in backward grad_in_feat, grad_in_feat_glob = bw_fn( RuntimeError: misaligned address at /mnt/lustre/zhoumengjie/Image-to-2-5DMap/MinkowskiEngine/src/broadcast_kernel.cu:402 terminate called after throwing an instance of 'c10::Error' what(): CUDA error: misaligned address

It looks like that the error happened during the bakcward phase. When I use bacth size 32, this error would happen when it runs to a specific epoch-batch. I checked the data and found that there are too many points (60,0000+) in this batch. But for other batches, the number of points is around 30,0000+. So I downsample the point cloud again, and it can work for batch size 32. However, there still exisits an limitation for the larger batch size. Is this possible to make MinkowskiEngine process a larger batch of data without downsampling? I'm looking forward to a more effective method.