Errors in RoI Point Pool

quotation2520 commented 2 years ago

Hi, I've been encountering the following errors in sfd_head.py, in line 583~586.

        points_features, points_neighbor, points_batch, points_roi, points_coords_src =\
              self.roicrop3d_gpu(batch_dict, self.model_cfg.ROI_POINT_CROP.POOL_EXTRA_WIDTH)
        points_features_expand = self.cpconvs_layer(points_features, points_neighbor)[1:]

Sometimes the error happens in roicrop3d_gpu() function as:

File "pcdet/models/roi_heads/sfd_head.py", line 550, in roicrop3d_gpu
    points_cur = image[total_pts_features[:, 7].long() + dx, total_pts_features[:, 6].long() + dy]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Other times, it happens in the forward pass in CPConvs module as:

File "/pcdet/models/roi_heads/sfd_head_utils.py", line 44, in forward
    point_empty = (points_neighbor == 0).nonzero()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I've been trying to transfer the codebase to spconv2.0, and I have succeeded in building the model. I just want to train it now... Has anyone experienced the same problem?

Thank you.

quotation2520 commented 2 years ago

I found the error occurred due to the out-of-range indices, and I fixed it by adding the following code in line 543:

        filter_idx = (2<=total_pts_features[:, 7]) * (total_pts_features[:, 7] < image.shape[0]-2) * (0<=total_pts_features[:, 6]) * (total_pts_features[:, 6] < image.shape[1]-2)
        total_pts_features = total_pts_features[filter_idx]
        total_pts_batch_index = total_pts_batch_index[filter_idx]
        total_pts_features_xyz_src = total_pts_features_xyz_src[filter_idx]

Now the training is running for more than 1000 iterations, so I hope that this settles this issue.

HuangLLL123 commented 10 months ago

hello, when i fix the problem with your code, i met a new problem like this : ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 12, 1]) have you ever met the error and do you know how to solve it?

LittlePey / SFD

Errors in RoI Point Pool #23