dvlab-research / LargeKernel3D

LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs (CVPR 2023)
https://arxiv.org/abs/2206.10555
Apache License 2.0
197 stars 8 forks source link

RuntimeError: /tmp/pip-build-env-oq41cytq/overlay/lib/python3.8/site-packages/cumm/include/tensorview/tensor.h(770) stride_valid assert faild. non-contiguous stride can't handled. #11

Open lda1049187465 opened 1 year ago

lda1049187465 commented 1 year ago

Hello, I encountered the following error while using the Det3D code for training. How can I solve it?

2023-06-16 15:43:45,501 - INFO - Start running, host: root@autodl-container-28d9119efa-79377d73, work_dir: /root/FocalsConv-master/CenterPoint/work_dirs/CONFIG 2023-06-16 15:43:45,501 - INFO - workflow: [('train', 1)], max: 20 epochs 2023-06-16 15:43:51,726 - INFO - finding looplift candidates 2023-06-16 15:43:51,755 - INFO - finding looplift candidates 2023-06-16 15:43:51,948 - INFO - finding looplift candidates 2023-06-16 15:43:52,046 - INFO - finding looplift candidates 2023-06-16 15:43:52,075 - INFO - finding looplift candidates 2023-06-16 15:43:52,119 - INFO - finding looplift candidates [Exception|indice_conv|subm]feat=torch.Size([135499, 16]),w=torch.Size([7, 7, 7, 16, 16]),pair=torch.Size([2, 343, 135499]),pairnum=tensor([ 4799, 9601, 11036, 4431, 3988, 3099, 2711, 3610, 6719, 13681, 8034, 4572, 4015, 2955, 2820, 4329, 11488, 13846, 5758, 4371, 3113, 2780, 3744, 7171, 16534, 9582, 4290, 3497, 2950, 3748, 4348, 10300, 13840, 5414, 3772, 2833, 3077, 3417, 4984, 13013, 9209, 4316, 2435, 2449, 2808, 3222, 7937, 12206, 5862, 5213, 9682, 10098, 4573, 4173, 3384, 2996, 4151, 7365, 13003, 7792, 5035, 4334, 3385, 3423, 5119, 12062, 13400, 6321, 4835, 3552, 3306, 4584, 8450, 16577, 9596, 4787, 3708, 3381, 4466, 5524, 11318, 13242, 5704, 3935, 3263, 3694, 4214, 6031, 12766, 8758, 4469, 2813, 2798, 3321, 3851, 8373, 11029, 5845, 5048, 11281, 11257, 5583, 5197, 4622, 4496, 4608, 8173, 15342, 9037, 6700, 6068, 5341, 4580, 6133, 14271, 15437, 8596, 7319, 5876, 4787, 6313, 11034, 20252, 11914, 7667, 6330, 5312, 6715, 8446, 15682, 15897, 8205, 7022, 5608, 6370, 7243, 9905, 17199, 11063, 7679, 5535, 5803, 6578, 7338, 13694, 15088, 9110, 42222, 50556, 46215, 36413, 35958, 36707, 36908, 38101, 47028, 56527, 43580, 41221, 41523, 39315, 37325, 43368, 62136, 65413, 52124, 45853, 40064, 38684, 45057, 64356, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0', dtype=torch.int32),act=135499,algo=ConvAlgo.Native SPCONV_DEBUG_SAVE_PATH not found, you can specify SPCONV_DEBUG_SAVE_PATH as debug data save path to save debug data which can be attached in a issue. Traceback (most recent call last): File "./tools/train.py", line 146, in main() File "./tools/train.py", line 134, in main train_detector( File "/root/FocalsConv-master/CenterPoint/det3d/torchie/apis/train.py", line 331, in train_detector trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank) File "/root/FocalsConv-master/CenterPoint/det3d/torchie/trainer/trainer.py", line 553, in run epoch_runner(data_loaders[i], self.epoch, kwargs) File "/root/FocalsConv-master/CenterPoint/det3d/torchie/trainer/trainer.py", line 419, in train outputs = self.batch_processor_inline( File "/root/FocalsConv-master/CenterPoint/det3d/torchie/trainer/trainer.py", line 378, in batch_processor_inline losses = model(example, return_loss=True) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/FocalsConv-master/CenterPoint/det3d/models/detectors/voxelnetfocal.py", line 85, in forward x, , loss_box_of_pts = self.extract_feat(data, batch_dict) File "/root/FocalsConv-master/CenterPoint/det3d/models/detectors/voxelnet_focal.py", line 38, in extract_feat x, voxel_feature, loss_box_of_pts = self.backbone( File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/FocalsConv-master/CenterPoint/det3d/models/backbones/scn_largekernel.py", line 350, in forward x_conv1, _loss = self.conv1(x, batch_dict) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/root/FocalsConv-master/CenterPoint/det3d/models/backbones/scn_largekernel.py", line 44, in forward input = module(input) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/FocalsConv-master/CenterPoint/det3d/models/backbones/scn_largekernel.py", line 215, in forward out = self.conv1(x) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/root/FocalsConv-master/CenterPoint/det3d/models/backbones/scn_largekernel.py", line 127, in forward x_conv_block = self.block(x_conv) File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/root/miniconda3/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 741, in forward return self._conv_forward(self.training, input, self.weight, self.bias, add_input, File "/root/miniconda3/lib/python3.8/site-packages/spconv/pytorch/conv.py", line 312, in _conv_forward out_features = Fsp.indice_subm_conv( File "/root/miniconda3/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 94, in decorate_fwd return fwd(args, **kwargs) File "/root/miniconda3/lib/python3.8/site-packages/spconv/pytorch/functional.py", line 327, in forward raise e File "/root/miniconda3/lib/python3.8/site-packages/spconv/pytorch/functional.py", line 308, in forward return ops.indice_conv(features, File "/root/miniconda3/lib/python3.8/site-packages/spconv/pytorch/ops.py", line 860, in indice_conv ConvGemmOps.indice_conv(alloc, ext_mm, GEMM_CPP, ALL_WEIGHT_IS_KRSC, RuntimeError: /tmp/pip-build-env-oq41cytq/overlay/lib/python3.8/site-packages/cumm/include/tensorview/tensor.h(770) stride_valid assert faild. non-contiguous stride can't handled.

yukang2017 commented 1 year ago

Sorry for the late reply. Would you please provide the version information, torch and spconv?

lda1049187465 commented 1 year ago

torch 1.10.0+cu113 torchvision 0.11.1+cu113 spconv-cu113 2.3.6

lda1049187465 commented 1 year ago

I can use your FocalsConv code normally, but using lagekernel3d may cause the above problem.

yukang2017 commented 1 year ago

Would you please have a try on spconv 2.1.x?

getabear commented 1 year ago

I resolved the issue by modifying the code.

def _convert_weight(self, weight):
        # weight [out, 7, 7, 7, in]
        weight_reshape = self.block.weight.permute(0, 4, 1, 2, 3).reshape(self.out_channels, self.in_channels,
                                                                          -1).clone()
        weight_return = self.block.weight.permute(0, 4, 1, 2, 3).reshape(self.out_channels, self.in_channels,
                                                                         -1).clone()
        for _indice in self._indice_list:
            _mean_weight = torch.mean(weight_reshape[:, :, _indice], dim=-1, keepdim=True)
            weight_return[:, :, _indice] = _mean_weight
        return weight_return.reshape(self.out_channels, self.in_channels, self.kernel_size, self.kernel_size,
                                     self.kernel_size).permute(0, 2, 3, 4, 1).contiguous()
Fragilesky commented 1 year ago

I resolved the issue by modifying the code.

def _convert_weight(self, weight):
        # weight [out, 7, 7, 7, in]
        weight_reshape = self.block.weight.permute(0, 4, 1, 2, 3).reshape(self.out_channels, self.in_channels,
                                                                          -1).clone()
        weight_return = self.block.weight.permute(0, 4, 1, 2, 3).reshape(self.out_channels, self.in_channels,
                                                                         -1).clone()
        for _indice in self._indice_list:
            _mean_weight = torch.mean(weight_reshape[:, :, _indice], dim=-1, keepdim=True)
            weight_return[:, :, _indice] = _mean_weight
        return weight_return.reshape(self.out_channels, self.in_channels, self.kernel_size, self.kernel_size,
                                     self.kernel_size).permute(0, 2, 3, 4, 1).contiguous()

Thanks very much, it works

Xavier-wa commented 6 days ago

I resolved the issue by modifying the code.

def _convert_weight(self, weight):
        # weight [out, 7, 7, 7, in]
        weight_reshape = self.block.weight.permute(0, 4, 1, 2, 3).reshape(self.out_channels, self.in_channels,
                                                                          -1).clone()
        weight_return = self.block.weight.permute(0, 4, 1, 2, 3).reshape(self.out_channels, self.in_channels,
                                                                         -1).clone()
        for _indice in self._indice_list:
            _mean_weight = torch.mean(weight_reshape[:, :, _indice], dim=-1, keepdim=True)
            weight_return[:, :, _indice] = _mean_weight
        return weight_return.reshape(self.out_channels, self.in_channels, self.kernel_size, self.kernel_size,
                                     self.kernel_size).permute(0, 2, 3, 4, 1).contiguous()

it works,But I don't understand why it works