Haiyang-W / UniTR

[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Multi-Modal Transformer for Bird’s-Eye-View Representation"
https://arxiv.org/abs/2308.07732
Apache License 2.0
292 stars 16 forks source link

Perceptual areas that do not overlap exactly #20

Closed midofalasol closed 8 months ago

midofalasol commented 8 months ago

Hello,

Thanks for sharing your excellent work. Very much appreciated. But I have a particular problem for myself.
If there is only a point cloud with a front-looking view (FOV less than 180°) and 6 surround-looking images in the multimodal fusion, is the UniTR method still effective? Is it possible to set the IMAGE2LIDAR and LIDAR2IMAGE sections to different sizes of point cloud range? If the solution is still viable, do I need to configure it otherwise?

Thank you very much and look forward to hearing from you.

nnnth commented 8 months ago

You can try modifying the following content in the config, which is the point_cloud_range parameter.

FUSE_BACKBONE:
    IMAGE2LIDAR: 
      block_start: 3
      block_end: 4
      point_cloud_range: [-54.0, -54.0, -10.0, 54.0, 54.0, 10.0]
      voxel_size: [0.3,0.3,20.0]
      sample_num: 20
      image2lidar_layer:
        sparse_shape: [360, 360, 1]
        d_model: [128]
        set_info: [[90, 1]]
        window_shape: [[30, 30, 1]]
        hybrid_factor: [1, 1, 1]
        shifts_list: [[[0, 0, 0], [15, 15, 0]]]
        expand_max_voxels: 10
    LIDAR2IMAGE:
      block_start: 1
      block_end: 3
      point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
      voxel_size: [0.3,0.3,8.0]
      sample_num: 1
      lidar2image_layer:
        sparse_shape: [96, 264, 6]
        d_model: [128]
        set_info: [[90, 2]]
        window_shape: [[30, 30, 1]]
        hybrid_factor: [1, 1, 1]
        shifts_list: [[[0, 0, 0], [15, 15, 0]]]
        expand_max_voxels: 30

Since these two processes are handled separately, UniTR is theoretically feasible, but we haven't tested this scenario, so some boundary handling may be needed.

midofalasol commented 8 months ago

@nnnth Thank you for your quick reply! I understand that yaml should be changed in this way(The INPUT_SHAPE of MAP_TO_BEV part should be the larger one of the two sensors.), could you please help to check it?

MODEL:
    NAME: UniTR

    MM_BACKBONE:
      NAME: UniTR
      PATCH_EMBED:
        in_channels: 3
        image_size: [256, 704]
        embed_dims: 128
        patch_size: 8 
        patch_norm: True 
        norm_cfg: {'type': 'LN'}

      IMAGE_INPUT_LAYER:
        sparse_shape: [32, 88, 1]
        d_model: [128]
        set_info: [[90, 4]]
        window_shape: [[30, 30, 1]]
        hybrid_factor: [1, 1, 1] # x, y, z
        shifts_list: [[[0, 0, 0], [15, 15, 0]]]
        input_image: True

      LIDAR_INPUT_LAYER:
        sparse_shape: [180, 360, 1]                 # modified
        d_model: [128]
        set_info: [[90, 4]]
        window_shape: [[30, 30, 1]]
        hybrid_factor: [1, 1, 1] # x, y, z
        shifts_list: [[[0, 0, 0], [15, 15, 0]]]

      set_info: [[90, 4]]
      d_model: [128]
      nhead: [8]
      dim_feedforward: [256]
      dropout: 0.0
      activation: gelu
      checkpoint_blocks: [0,1,2,3] # here can save 50% CUDA memory with marginal speed drop
      layer_cfg: {'use_bn': False, 'split_ffn': True, 'split_residual': True}

      # fuse backbone config
      FUSE_BACKBONE:
        IMAGE2LIDAR: 
          block_start: 3
          block_end: 4
          point_cloud_range: [-54.0, -54.0, -10.0, 54.0, 54.0, 10.0]
          voxel_size: [0.3,0.3,20.0]
          sample_num: 20
          image2lidar_layer:
            sparse_shape: [360, 360, 1]
            d_model: [128]
            set_info: [[90, 1]]
            window_shape: [[30, 30, 1]]
            hybrid_factor: [1, 1, 1]
            shifts_list: [[[0, 0, 0], [15, 15, 0]]]
            expand_max_voxels: 10
        LIDAR2IMAGE:
          block_start: 1
          block_end: 3
          point_cloud_range: [0.0, -54.0, -5.0, 54.0, 54.0, 3.0]                     # modified
          voxel_size: [0.3,0.3,8.0]
          sample_num: 1
          lidar2image_layer:
            sparse_shape: [96, 264, 3]                                             # modified
            d_model: [128]
            set_info: [[90, 2]]
            window_shape: [[30, 30, 1]]
            hybrid_factor: [1, 1, 1]
            shifts_list: [[[0, 0, 0], [15, 15, 0]]]
            expand_max_voxels: 30
      out_indices: []

    VFE:
      NAME: DynPillarVFE
      WITH_DISTANCE: False
      USE_ABSLOTE_XYZ: True
      USE_NORM: True
      NUM_FILTERS: [ 128, 128 ]

    MAP_TO_BEV:
      NAME: PointPillarScatter3d
      INPUT_SHAPE: [360, 360, 1]
      NUM_BEV_FEATURES: 128

    BACKBONE_2D:
      NAME: BaseBEVResBackbone
      LAYER_NUMS: [ 1, 2, 2, 2] # 
      LAYER_STRIDES: [1, 2, 2, 2]
      NUM_FILTERS: [128, 128, 256, 256]
      UPSAMPLE_STRIDES: [0.5, 1, 2, 4]
      NUM_UPSAMPLE_FILTERS: [128, 128, 128, 128]

I run through this similar configuration(with lss) successfully, but still want to make sure that the two modal features were sufficiently fused.

Sincerely appreciate your help~

midofalasol commented 8 months ago

In addition, I observed that only 10 epochs need to be trained in unitr+lss.yaml. Do I need to train the two modes separately first and then load the pre-trained model for fusion? @nnnth @Haiyang-W

Haiyang-W commented 8 months ago

In addition, I observed that only 10 epochs need to be trained in unitr+lss.yaml. Do I need to train the two modes separately first and then load the pre-trained model for fusion? @nnnth @Haiyang-W

One major advantage of our model is its one-stage training, eliminating the need to separately train two branches. It only requires one initialization. Currently, we use ImageNet initialization, but you can even use pre-trained parameters from ViT.

Haiyang-W commented 8 months ago

This means that the entire training process only requires these 10 epochs.

nnnth commented 8 months ago

I think most of your modifications are right except for the sparse shape in Lidar2Image. All image features should undergo self-attention, regardless of whether they are covered by point clouds or not. Therefore, I believe the sparse shape should still be [96, 264, 6]. The best way to verify is through visualization, such as visualizing the mapped point clouds in image space. Feel free to ask if you have any other questions.

midofalasol commented 8 months ago

Thank you very much for your meticulous reply. Most of my problems have been solved.

midofalasol commented 8 months ago

@nnnth @Haiyang-W Hello, I trained according to the above configuration but there is a new confusion. I found that the mAP effect at the end of training was not high, about the same as the effect of training a single image mode using other methods. I thought this phenomenon was abnormal, but I did not understand why. To verify the gain from the addition of the lidar, I also tested the mAP of the forward view area separately, but the results seemed to be the same. Is there anything else that needs attention?

nnnth commented 8 months ago

Have you tried using lidar alone? Firstly, it's important to ensure that both lidar and image perform adequately when used individually. Secondly, it may be necessary to make some modifications to the Fuser, as currently, the coverage range of LiDAR is smaller than that of images. I think you may need to pad the lidar bev feature to align coordinates.

midofalasol commented 8 months ago

@Haiyang-W Sorry, I have checked that there are some problems with my evaluation configuration, and the above problems have been solved. Thank you for your reply.

Haiyang-W commented 8 months ago

@Haiyang-W Sorry, I have checked that there are some problems with my evaluation configuration, and the above problems have been solved. Thank you for your reply.

No worry, as long as it's resolved. Wish you success. :)

caffreypu commented 5 months ago

@Haiyang-W Sorry, I have checked that there are some problems with my evaluation configuration, and the above problems have been solved. Thank you for your reply.

hi, do you change other files except for that yaml ?

SivlerGlow commented 5 months ago

Hello,

Thanks for sharing the excellent work. I am also interested in the idea of only a point cloud with a front-looking view (FOV less than 180°) and 6 surround-looking images in the multimodal fusion. I follow the modifications in the config and test Unitr on the nuscenes-mini dataset. However, I encounter a problem that seems to be related to DVST.

Traceback (most recent call last):
    File "tools/test.py", line 220, in <module>
         main()
    File "tools/test.py", line 216, in main
         eval_single_ckpt(model, test_loader, args, eval_output_dir, logger, epoch_id, dist_test=dist_test)
    File "tools/test.py", line 72, in eval_single_ckpt
         eval_utils.eval_one_epoch(
    File "/code/UniTR-main/tools/eval_utils/eval_utils.py", line 65, in eval_one_epoch
        pred_dicts, ret_dict = model(batch_dict)
    File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
    File "/code/UniTR-main/pcdet/models/detectors/unitr.py", line 103, in forward
        batch_dict = cur_module(batch_dict)
    File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
    File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 192, in forward
        output = block(output, multi_set_voxel_inds_list[stage_id], multi_set_voxel_masks_list[stage_id], multi_pos_embed_list[stage_id][i],
    File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
    File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 385, in forward
        output = layer(output, set_voxel_inds,
    File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
    File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 404, in forward
        src = self.win_attn(src, pos, set_voxel_masks,
   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
        return forward_call(*input, **kwargs)
   File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 488, in forward
        src = src + self.dropout1(src2)
RuntimeError: The size of tensor a (95912) must match the size of tensor b (95904) at non-singleton dimension 0
eval:  11%|          | 3/27 [00:08<01:09,  2.89s/it, recall_0.3=(0, 263) / 317]

I find a similar issue in this link: Ref: https://github.com/Haiyang-W/DSVT/issues/51 The comment is written by @chenshi3. Thanks for his answer.

I think the issue may be caused by torch.floor operation in DSVTInputLayer. To debug this issue, I recommend reviewing the code in the get_set_single_shift function to see which operation causes loss of voxel index. If you're still having trouble, feel free to send me an email and we can work together to debug the issue more thoroughly.

I find that you use the incorrect sparse_shape and window_shape which should match the POINT_CLOUD_RANGE and VOXEL_SIZE. If these parameters are not set correctly, they can cause issues and lead to unexpected results.

I have tried multiple times to modify other parameters related to lidar in the in the config. Unfortunately, this issue has not been resolved.

I guess that some voxel_inds were not included during window shifting, but I don't know how to solve this problem.

Modifications to config:

MODEL:
    NAME: UniTR

    MM_BACKBONE:
      NAME: UniTR
      PATCH_EMBED:
        in_channels: 3
        image_size: [256, 704]
        embed_dims: 128
        patch_size: 8 
        patch_norm: True 
        norm_cfg: {'type': 'LN'}

      IMAGE_INPUT_LAYER:
        sparse_shape: [32, 88, 1]
        d_model: [128]
        set_info: [[90, 4]]
        window_shape: [[30, 30, 1]]
        hybrid_factor: [1, 1, 1] # x, y, z
        shifts_list: [[[0, 0, 0], [15, 15, 0]]]
        input_image: True

      LIDAR_INPUT_LAYER:
        sparse_shape: [180, 360, 1]                 # modified
        d_model: [128]
        set_info: [[90, 4]]
        window_shape: [[30, 30, 1]]
        hybrid_factor: [1, 1, 1] # x, y, z
        shifts_list: [[[0, 0, 0], [15, 15, 0]]]

      set_info: [[90, 4]]
      d_model: [128]
      nhead: [8]
      dim_feedforward: [256]
      dropout: 0.0
      activation: gelu
      checkpoint_blocks: [0,1,2,3] # here can save 50% CUDA memory with marginal speed drop
      layer_cfg: {'use_bn': False, 'split_ffn': True, 'split_residual': True}

      # fuse backbone config
      FUSE_BACKBONE:
        IMAGE2LIDAR: 
          block_start: 3
          block_end: 4
          point_cloud_range: [-54.0, -54.0, -10.0, 54.0, 54.0, 10.0]
          voxel_size: [0.3,0.3,20.0]
          sample_num: 20
          image2lidar_layer:
            sparse_shape: [360, 360, 1]
            d_model: [128]
            set_info: [[90, 1]]
            window_shape: [[30, 30, 1]]
            hybrid_factor: [1, 1, 1]
            shifts_list: [[[0, 0, 0], [15, 15, 0]]]
            expand_max_voxels: 10
        LIDAR2IMAGE:
          block_start: 1
          block_end: 3
          point_cloud_range: [0.0, -54.0, -5.0, 54.0, 54.0, 3.0]                     # modified
          voxel_size: [0.3,0.3,8.0]
          sample_num: 1
          lidar2image_layer:
            sparse_shape: [96, 264, 6]
            d_model: [128]
            set_info: [[90, 2]]
            window_shape: [[30, 30, 1]]
            hybrid_factor: [1, 1, 1]
            shifts_list: [[[0, 0, 0], [15, 15, 0]]]
            expand_max_voxels: 30
      out_indices: []

    VFE:
      NAME: DynPillarVFE
      WITH_DISTANCE: False
      USE_ABSLOTE_XYZ: True
      USE_NORM: True
      NUM_FILTERS: [ 128, 128 ]

    MAP_TO_BEV:
      NAME: PointPillarScatter3d
      INPUT_SHAPE: [360, 360, 1]
      NUM_BEV_FEATURES: 128

    BACKBONE_2D:
      NAME: BaseBEVResBackbone
      LAYER_NUMS: [ 1, 2, 2, 2] # 
      LAYER_STRIDES: [1, 2, 2, 2]
      NUM_FILTERS: [128, 128, 256, 256]
      UPSAMPLE_STRIDES: [0.5, 1, 2, 4]
      NUM_UPSAMPLE_FILTERS: [128, 128, 128, 128]

If you could give me some advice, I would greatly appreciate it. @midofalasol @Haiyang-W @nnnth Thank you very much and look forward to hearing from you.