During the model training process, an error occurs in the forward propagation phase. The error message indicates a dimension mismatch between the feature processed by the attention mechanism in dsvt.py and the original feature.

luoxiaoliaolan commented 11 months ago

Thank you for your outstanding work. I have been following your work for a long time and have been wanting to try the DSVT model for 3D object detection myself. I ran the DSVT model on OpenPCDet. The training data I used was self-prepared and followed a format similar to KITTI, including point cloud data and 3D annotation files. I have completed the preprocessing of the dataset, and I have successfully trained and tested it using both the CenterPoint and PVRCNN++ models. However, when attempting to train using the DSVT model, I encountered an error during the training process. Below is the error message: Traceback (most recent call last): File "train_single.py", line 245, in <module> main() File "train_single.py", line 189, in main train_model( File "/mnt/volumes/perception/lyb/openpcdet/tools/train_utils/train_utils.py", line 180, in train_model accumulated_iter = train_one_epoch( File "/mnt/volumes/perception/lyb/openpcdet/tools/train_utils/train_utils.py", line 56, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/mnt/volumes/perception/lyb/openpcdet/tools/../pcdet/models/__init__.py", line 44, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mnt/volumes/perception/lyb/openpcdet/tools/../pcdet/models/detectors/centerpoint.py", line 12, in forward batch_dict = cur_module(batch_dict) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mnt/volumes/perception/lyb/openpcdet/tools/../pcdet/models/backbones_3d/dsvt.py", line 125, in forward output = block(output, set_voxel_inds_list[stage_id], set_voxel_masks_list[stage_id], pos_embed_list[stage_id][i], \ File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mnt/volumes/perception/lyb/openpcdet/tools/../pcdet/models/backbones_3d/dsvt.py", line 193, in forward output = layer(output, set_voxel_inds, set_voxel_masks, pos_embed) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mnt/volumes/perception/lyb/openpcdet/tools/../pcdet/models/backbones_3d/dsvt.py", line 209, in forward src = self.win_attn(src, pos, set_voxel_masks, set_voxel_inds) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mnt/volumes/perception/lyb/openpcdet/tools/../pcdet/models/backbones_3d/dsvt.py", line 273, in forward src = src + self.dropout1(src2) RuntimeError: The size of tensor a (322658) must match the size of tensor b (322639) at non-singleton dimension 0

After receiving this error message, I checked the relevant code location and discovered that the function takes the input feature "src" (src (Tensor[float]): Voxel features with shape (N, C), where N is the number of voxels) and applies an attention mechanism to obtain "src2." However, in the operation "src = src + self.dropout1(src2)," an error occurs due to the mismatch in dimensions between the two features, preventing the training process. To troubleshoot this issue, I added the following code snippet: # FFN layer print(f"src.shape: {src.shape}") print(f"src2.shape: {src2.shape}") During runtime, the output results were as follows: src.shape: torch.Size([546464, 192]) src2.shape: torch.Size([546464, 192]) src.shape: torch.Size([546464, 192]) src2.shape: torch.Size([546464, 192]) src.shape: torch.Size([406328, 192]) src2.shape: torch.Size([406328, 192]) src.shape: torch.Size([406328, 192]) src2.shape: torch.Size([406328, 192]) src.shape: torch.Size([322658, 192]) src2.shape: torch.Size([322639, 192]) The execution continued until the highlighted section above, where the errors started occurring.

I debugged the information for this line（src2 = self.self_attn(query, key, value, key_padding_mask)[0]） of code and obtained the following situation: query shape: (13907, 48, 192) key: (13907, 48, 192) value: (13907, 48, 192) key_padding_mask:(13907, 48) src2 shape: (322639, 192)

I'm not sure how to resolve this error. Could you help take a look at it?

I've attached my model configuration. If you need more information, please feel free to message me privately. dsvt_3d.yaml

`CLASS_NAMES: ['traffic_cone', 'traffic_column', 'Tripod']

DATA_CONFIG: _BASECONFIG: cfgs/dataset_configs/pandar_dataset_3class.yaml OUTPUT_PATH: '/lpai/volumes/perception/lyb/output'

POINT_CLOUD_RANGE: [ -82.0, -60.0, -3.0, 82.0, 60.0, 3.0 ] DATA_AUGMENTOR: DISABLE_AUG_LIST: ['placeholder'] AUG_CONFIG_LIST:

NAME: gt_sampling USE_ROAD_PLANE: False DB_INFO_PATH:
- pandar128_dbinfos_train.pkl
  
  USE_SHARED_MEMORY: False # set it to True to speed up (it costs about 15GB shared memory)
  
  PREPARE: { filter_by_min_points: [ 'traffic_cone:5', 'traffic_column:5', 'Tripod:5'], }
  
  SAMPLE_GROUPS: [ 'traffic_cone:6', 'traffic_column:5', 'Tripod:5'] NUM_POINT_FEATURES: 4 REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0] LIMIT_WHOLE_SCENE: True
NAME: random_world_flip ALONG_AXIS_LIST: ['x', 'y']
NAME: random_world_rotation WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]
NAME: random_world_scaling WORLD_SCALE_RANGE: [0.95, 1.05]
NAME: random_world_translation NOISE_TRANSLATE_STD: [0.5, 0.5, 0.5]

DATA_PROCESSOR:
- NAME: mask_points_and_boxes_outside_range REMOVE_OUTSIDE_BOXES: True
- NAME: shuffle_points SHUFFLE_ENABLED: { 'train': True, 'test': False }
- NAME: transform_points_to_voxels_placeholder VOXEL_SIZE: [ 0.1, 0.1, 0.15 ]

MODEL: NAME: CenterPoint

VFE: NAME: DynamicVoxelVFE WITH_DISTANCE: False USE_ABSLOTE_XYZ: True USE_NORM: True NUM_FILTERS: [ 192, 192 ]

BACKBONE_3D: NAME: DSVT INPUT_LAYER: sparse_shape: [468, 468, 32] downsample_stride: [[1, 1, 4], [1, 1, 4], [1, 1, 2]] d_model: [192, 192, 192, 192] set_info: [[48, 1], [48, 1], [48, 1], [48, 1]] window_shape: [[12, 12, 32], [12, 12, 8], [12, 12, 2], [12, 12, 1]] hybrid_factor: [2, 2, 1] # x, y, z shifts_list: [[[0, 0, 0], [6, 6, 0]], [[0, 0, 0], [6, 6, 0]], [[0, 0, 0], [6, 6, 0]], [[0, 0, 0], [6, 6, 0]]] normalize_pos: False

block_name: ['DSVTBlock','DSVTBlock','DSVTBlock','DSVTBlock']
set_info: [[48, 1], [48, 1], [48, 1], [48, 1]]
d_model: [192, 192, 192, 192]
nhead: [8, 8, 8, 8]
dim_feedforward: [384, 384, 384, 384]
dropout: 0.0 
activation: gelu
reduction_type: 'attention'
output_shape: [468, 468]
conv_out_channel: 192

MAP_TO_BEV: NAME: PointPillarScatter3d INPUT_SHAPE: [468, 468, 1] NUM_BEV_FEATURES: 192

BACKBONE_2D: NAME: BaseBEVResBackbone LAYER_NUMS: [ 1, 2, 2 ] LAYER_STRIDES: [ 1, 2, 2 ] NUM_FILTERS: [ 128, 128, 256 ] UPSAMPLE_STRIDES: [ 1, 2, 4 ] NUM_UPSAMPLE_FILTERS: [ 128, 128, 128 ]

DENSE_HEAD: NAME: CenterHead CLASS_AGNOSTIC: False

CLASS_NAMES_EACH_HEAD: [
  ['traffic_cone', 'traffic_column', 'Tripod']
]

SHARED_CONV_CHANNEL: 64
USE_BIAS_BEFORE_NORM: False
NUM_HM_CONV: 2

BN_EPS: 0.001
BN_MOM: 0.01
SEPARATE_HEAD_CFG:
  HEAD_ORDER: ['center', 'center_z', 'dim', 'rot']
  HEAD_DICT: {
    'center': {'out_channels': 2, 'num_conv': 2},
    'center_z': {'out_channels': 1, 'num_conv': 2},
    'dim': {'out_channels': 3, 'num_conv': 2},
    'rot': {'out_channels': 2, 'num_conv': 2},
    'iou': {'out_channels': 1, 'num_conv': 2},
  }

TARGET_ASSIGNER_CONFIG:
  FEATURE_MAP_STRIDE: 1
  NUM_MAX_OBJS: 500
  GAUSSIAN_OVERLAP: 0.1
  MIN_RADIUS: 2

IOU_REG_LOSS: True

LOSS_CONFIG:
  LOSS_WEIGHTS: {
    'cls_weight': 1.0,
    'loc_weight': 2.0,
    'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
  }

POST_PROCESSING:
  SCORE_THRESH: 0.5
  POST_CENTER_LIMIT_RANGE: [ -82.0, -60.0, -3.0, 82.0, 60.0, 3.0 ]
  MAX_OBJ_PER_SAMPLE: 500

  USE_IOU_TO_RECTIFY_SCORE: True
  IOU_RECTIFIER: [0.68, 0.71, 0.65]

  NMS_CONFIG:
    NMS_TYPE: multi_class_nms  # only for centerhead， use mmdet3d version nms
    NMS_THRESH: [0.5, 0.5, 0.6]
    NMS_PRE_MAXSIZE: [4096, 4096, 4096]
    NMS_POST_MAXSIZE: [500, 500, 500]

POST_PROCESSING: RECALL_THRESH_LIST: [0.3, 0.5, 0.7]

EVAL_METRIC: kitti

OPTIMIZATION: BATCH_SIZE_PER_GPU: 4 NUM_EPOCHS: 30

OPTIMIZER: adam_onecycle LR: 0.003 WEIGHT_DECAY: 0.01 MOMENTUM: 0.9

MOMS: [0.95, 0.85] PCT_START: 0.1 DIV_FACTOR: 100 DECAY_STEP_LIST: [35, 45] LR_DECAY: 0.1 LR_CLIP: 0.0000001

LR_WARMUP: False WARMUP_EPOCH: 1

GRAD_NORM_CLIP: 10 LOSS_SCALE_FP16: 32.0

HOOK: DisableAugmentationHook: DISABLE_AUG_LIST: ['gt_sampling','random_world_flip','random_world_rotation','random_world_scaling', 'random_world_translation'] NUM_LAST_EPOCHS: 1`

I'm looking forward to your response and hoping to maintain communication with you.

Haiyang-W commented 11 months ago

Thanks for your interesting in DSVT. I am rushing on some ddls and may not be able to reply you temporarily.

At present, it seems that the problem is caused by 3D pooling or wrong partitioning in some corner cases.

It is recommended that you try some dsvt-p first, and this problem should not occur.

I'll take a closer look at it in a day or two.

chenshi3 commented 11 months ago

I think the issue may be caused by torch.floor operation in DSVTInputLayer. To debug this issue, I recommend reviewing the code in the get_set_single_shift function to see which operation causes loss of voxel index. If you're still having trouble, feel free to send me an email and we can work together to debug the issue more thoroughly.

chenshi3 commented 11 months ago

I think the issue may be caused by torch.floor operation in DSVTInputLayer. To debug this issue, I recommend reviewing the code in the get_set_single_shift function to see which operation causes loss of voxel index. If you're still having trouble, feel free to send me an email and we can work together to debug the issue more thoroughly.

I find that you use the incorrect sparse_shape and window_shape which should match the POINT_CLOUD_RANGE and VOXEL_SIZE. If these parameters are not set correctly, they can cause issues and lead to unexpected results.

luoxiaoliaolan commented 11 months ago

@chenshi3 Thank you very much for your patient responses. Following your suggestions, I referred to the DSVT model configuration provided by your team on OpenPcdet. I made modifications to the parameters "POINT_CLOUD_RANGE" and "VOXEL_SIZE" and after retraining, the training process is now proceeding smoothly. now "POINT_CLOUD_RANGE" = [-74.88, -74.88, -2, 74.88, 74.88, 4.0], "VOXEL_SIZE": [0.32, 0.32, 0.1875] However, I would like to adjust these two parameters to better train my own dataset. My requirement is to reduce the size of the voxel_size a bit. I hope POINT_CLOUD_RANGE of approximately [-80.0, -60.0, -3, 80.0, 60.0, 3.0]. Could you provide an example based on the parameter values I mentioned, illustrating how to set these two parameters? What is the process for calculating these parameters? Thank you very much!

chenshi3 commented 11 months ago

With POINT_CLOUD_RANGE seting to [-80.0, -60.0, -3, 80.0, 60.0, 3.0], I suggest configuring VOXEL_SIZE as [0.4, 0.4, 0.1875]. Consequently, the sparse_shape should be [400, 300, 32], while the window_shape can remain unaltered.

luoxiaoliaolan commented 11 months ago

@chenshi3 I'm glad to discuss the training process of the DSVT model with you. I've seen your response at During the model training process, an error occurs in the forward propagation phase. The error messa and would like to further elaborate on the topic. My LiDAR sensor scans within a range of 120 meters front and rear, and 120 meters left and right with a 360-degree rotating laser. During the training of this model, I aim to focus on detection within a range of 80 meters front and rear, and 60 meters left and right. Of course, this doesn't mean it has to be strictly limited to this range; there can be some flexibility in the vertical dimension as well. Regarding the setting of the voxel size, I've been considering whether using a larger voxel size might result in missing some fine-grained feature details during sampling, potentially leading to suboptimal detection of small objects. I'd like to understand how the size of the voxel affects both the training and detection performance of the model. Therefore, I'm contemplating whether it's feasible to decrease the voxel size while keeping the detection range constant. However, this adjustment would require tweaking certain model parameters, such as sparse_shape, d_model, and conv_out_channel. As of now, I haven't managed to fine-tune these parameters to ensure smooth training.

Haiyang-W / DSVT

During the model training process, an error occurs in the forward propagation phase. The error message indicates a dimension mismatch between the feature processed by the attention mechanism in dsvt.py and the original feature. #51