Closed midofalasol closed 8 months ago
You can try modifying the following content in the config, which is the point_cloud_range
parameter.
FUSE_BACKBONE:
IMAGE2LIDAR:
block_start: 3
block_end: 4
point_cloud_range: [-54.0, -54.0, -10.0, 54.0, 54.0, 10.0]
voxel_size: [0.3,0.3,20.0]
sample_num: 20
image2lidar_layer:
sparse_shape: [360, 360, 1]
d_model: [128]
set_info: [[90, 1]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1]
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
expand_max_voxels: 10
LIDAR2IMAGE:
block_start: 1
block_end: 3
point_cloud_range: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
voxel_size: [0.3,0.3,8.0]
sample_num: 1
lidar2image_layer:
sparse_shape: [96, 264, 6]
d_model: [128]
set_info: [[90, 2]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1]
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
expand_max_voxels: 30
Since these two processes are handled separately, UniTR is theoretically feasible, but we haven't tested this scenario, so some boundary handling may be needed.
@nnnth Thank you for your quick reply! I understand that yaml should be changed in this way(The INPUT_SHAPE of MAP_TO_BEV part should be the larger one of the two sensors.), could you please help to check it?
MODEL:
NAME: UniTR
MM_BACKBONE:
NAME: UniTR
PATCH_EMBED:
in_channels: 3
image_size: [256, 704]
embed_dims: 128
patch_size: 8
patch_norm: True
norm_cfg: {'type': 'LN'}
IMAGE_INPUT_LAYER:
sparse_shape: [32, 88, 1]
d_model: [128]
set_info: [[90, 4]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1] # x, y, z
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
input_image: True
LIDAR_INPUT_LAYER:
sparse_shape: [180, 360, 1] # modified
d_model: [128]
set_info: [[90, 4]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1] # x, y, z
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
set_info: [[90, 4]]
d_model: [128]
nhead: [8]
dim_feedforward: [256]
dropout: 0.0
activation: gelu
checkpoint_blocks: [0,1,2,3] # here can save 50% CUDA memory with marginal speed drop
layer_cfg: {'use_bn': False, 'split_ffn': True, 'split_residual': True}
# fuse backbone config
FUSE_BACKBONE:
IMAGE2LIDAR:
block_start: 3
block_end: 4
point_cloud_range: [-54.0, -54.0, -10.0, 54.0, 54.0, 10.0]
voxel_size: [0.3,0.3,20.0]
sample_num: 20
image2lidar_layer:
sparse_shape: [360, 360, 1]
d_model: [128]
set_info: [[90, 1]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1]
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
expand_max_voxels: 10
LIDAR2IMAGE:
block_start: 1
block_end: 3
point_cloud_range: [0.0, -54.0, -5.0, 54.0, 54.0, 3.0] # modified
voxel_size: [0.3,0.3,8.0]
sample_num: 1
lidar2image_layer:
sparse_shape: [96, 264, 3] # modified
d_model: [128]
set_info: [[90, 2]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1]
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
expand_max_voxels: 30
out_indices: []
VFE:
NAME: DynPillarVFE
WITH_DISTANCE: False
USE_ABSLOTE_XYZ: True
USE_NORM: True
NUM_FILTERS: [ 128, 128 ]
MAP_TO_BEV:
NAME: PointPillarScatter3d
INPUT_SHAPE: [360, 360, 1]
NUM_BEV_FEATURES: 128
BACKBONE_2D:
NAME: BaseBEVResBackbone
LAYER_NUMS: [ 1, 2, 2, 2] #
LAYER_STRIDES: [1, 2, 2, 2]
NUM_FILTERS: [128, 128, 256, 256]
UPSAMPLE_STRIDES: [0.5, 1, 2, 4]
NUM_UPSAMPLE_FILTERS: [128, 128, 128, 128]
I run through this similar configuration(with lss) successfully, but still want to make sure that the two modal features were sufficiently fused.
Sincerely appreciate your help~
In addition, I observed that only 10 epochs need to be trained in unitr+lss.yaml. Do I need to train the two modes separately first and then load the pre-trained model for fusion? @nnnth @Haiyang-W
In addition, I observed that only 10 epochs need to be trained in unitr+lss.yaml. Do I need to train the two modes separately first and then load the pre-trained model for fusion? @nnnth @Haiyang-W
One major advantage of our model is its one-stage training, eliminating the need to separately train two branches. It only requires one initialization. Currently, we use ImageNet initialization, but you can even use pre-trained parameters from ViT.
This means that the entire training process only requires these 10 epochs.
I think most of your modifications are right except for the sparse shape
in Lidar2Image.
All image features should undergo self-attention, regardless of whether they are covered by point clouds or not. Therefore, I believe the sparse shape should still be [96, 264, 6].
The best way to verify is through visualization, such as visualizing the mapped point clouds in image space.
Feel free to ask if you have any other questions.
Thank you very much for your meticulous reply. Most of my problems have been solved.
@nnnth @Haiyang-W Hello, I trained according to the above configuration but there is a new confusion. I found that the mAP effect at the end of training was not high, about the same as the effect of training a single image mode using other methods. I thought this phenomenon was abnormal, but I did not understand why. To verify the gain from the addition of the lidar, I also tested the mAP of the forward view area separately, but the results seemed to be the same. Is there anything else that needs attention?
Have you tried using lidar alone? Firstly, it's important to ensure that both lidar and image perform adequately when used individually. Secondly, it may be necessary to make some modifications to the Fuser
, as currently, the coverage range of LiDAR is smaller than that of images. I think you may need to pad the lidar bev feature to align coordinates.
@Haiyang-W Sorry, I have checked that there are some problems with my evaluation configuration, and the above problems have been solved. Thank you for your reply.
@Haiyang-W Sorry, I have checked that there are some problems with my evaluation configuration, and the above problems have been solved. Thank you for your reply.
No worry, as long as it's resolved. Wish you success. :)
@Haiyang-W Sorry, I have checked that there are some problems with my evaluation configuration, and the above problems have been solved. Thank you for your reply.
hi, do you change other files except for that yaml ?
Hello,
Thanks for sharing the excellent work. I am also interested in the idea of only a point cloud with a front-looking view (FOV less than 180°) and 6 surround-looking images in the multimodal fusion. I follow the modifications in the config and test Unitr on the nuscenes-mini dataset. However, I encounter a problem that seems to be related to DVST.
Traceback (most recent call last):
File "tools/test.py", line 220, in <module>
main()
File "tools/test.py", line 216, in main
eval_single_ckpt(model, test_loader, args, eval_output_dir, logger, epoch_id, dist_test=dist_test)
File "tools/test.py", line 72, in eval_single_ckpt
eval_utils.eval_one_epoch(
File "/code/UniTR-main/tools/eval_utils/eval_utils.py", line 65, in eval_one_epoch
pred_dicts, ret_dict = model(batch_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/code/UniTR-main/pcdet/models/detectors/unitr.py", line 103, in forward
batch_dict = cur_module(batch_dict)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 192, in forward
output = block(output, multi_set_voxel_inds_list[stage_id], multi_set_voxel_masks_list[stage_id], multi_pos_embed_list[stage_id][i],
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 385, in forward
output = layer(output, set_voxel_inds,
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 404, in forward
src = self.win_attn(src, pos, set_voxel_masks,
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/code/UniTR-main/pcdet/models/mm_backbone/unitr.py", line 488, in forward
src = src + self.dropout1(src2)
RuntimeError: The size of tensor a (95912) must match the size of tensor b (95904) at non-singleton dimension 0
eval: 11%| | 3/27 [00:08<01:09, 2.89s/it, recall_0.3=(0, 263) / 317]
I find a similar issue in this link: Ref: https://github.com/Haiyang-W/DSVT/issues/51 The comment is written by @chenshi3. Thanks for his answer.
I think the issue may be caused by torch.floor operation in DSVTInputLayer. To debug this issue, I recommend reviewing the code in the get_set_single_shift function to see which operation causes loss of voxel index. If you're still having trouble, feel free to send me an email and we can work together to debug the issue more thoroughly.
I find that you use the incorrect sparse_shape and window_shape which should match the POINT_CLOUD_RANGE and VOXEL_SIZE. If these parameters are not set correctly, they can cause issues and lead to unexpected results.
I have tried multiple times to modify other parameters related to lidar in the in the config. Unfortunately, this issue has not been resolved.
I guess that some voxel_inds were not included during window shifting, but I don't know how to solve this problem.
Modifications to config:
MODEL:
NAME: UniTR
MM_BACKBONE:
NAME: UniTR
PATCH_EMBED:
in_channels: 3
image_size: [256, 704]
embed_dims: 128
patch_size: 8
patch_norm: True
norm_cfg: {'type': 'LN'}
IMAGE_INPUT_LAYER:
sparse_shape: [32, 88, 1]
d_model: [128]
set_info: [[90, 4]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1] # x, y, z
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
input_image: True
LIDAR_INPUT_LAYER:
sparse_shape: [180, 360, 1] # modified
d_model: [128]
set_info: [[90, 4]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1] # x, y, z
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
set_info: [[90, 4]]
d_model: [128]
nhead: [8]
dim_feedforward: [256]
dropout: 0.0
activation: gelu
checkpoint_blocks: [0,1,2,3] # here can save 50% CUDA memory with marginal speed drop
layer_cfg: {'use_bn': False, 'split_ffn': True, 'split_residual': True}
# fuse backbone config
FUSE_BACKBONE:
IMAGE2LIDAR:
block_start: 3
block_end: 4
point_cloud_range: [-54.0, -54.0, -10.0, 54.0, 54.0, 10.0]
voxel_size: [0.3,0.3,20.0]
sample_num: 20
image2lidar_layer:
sparse_shape: [360, 360, 1]
d_model: [128]
set_info: [[90, 1]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1]
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
expand_max_voxels: 10
LIDAR2IMAGE:
block_start: 1
block_end: 3
point_cloud_range: [0.0, -54.0, -5.0, 54.0, 54.0, 3.0] # modified
voxel_size: [0.3,0.3,8.0]
sample_num: 1
lidar2image_layer:
sparse_shape: [96, 264, 6]
d_model: [128]
set_info: [[90, 2]]
window_shape: [[30, 30, 1]]
hybrid_factor: [1, 1, 1]
shifts_list: [[[0, 0, 0], [15, 15, 0]]]
expand_max_voxels: 30
out_indices: []
VFE:
NAME: DynPillarVFE
WITH_DISTANCE: False
USE_ABSLOTE_XYZ: True
USE_NORM: True
NUM_FILTERS: [ 128, 128 ]
MAP_TO_BEV:
NAME: PointPillarScatter3d
INPUT_SHAPE: [360, 360, 1]
NUM_BEV_FEATURES: 128
BACKBONE_2D:
NAME: BaseBEVResBackbone
LAYER_NUMS: [ 1, 2, 2, 2] #
LAYER_STRIDES: [1, 2, 2, 2]
NUM_FILTERS: [128, 128, 256, 256]
UPSAMPLE_STRIDES: [0.5, 1, 2, 4]
NUM_UPSAMPLE_FILTERS: [128, 128, 128, 128]
If you could give me some advice, I would greatly appreciate it. @midofalasol @Haiyang-W @nnnth Thank you very much and look forward to hearing from you.
Hello,
Thanks for sharing your excellent work. Very much appreciated. But I have a particular problem for myself.
If there is only a point cloud with a front-looking view (FOV less than 180°) and 6 surround-looking images in the multimodal fusion, is the UniTR method still effective? Is it possible to set the IMAGE2LIDAR and LIDAR2IMAGE sections to different sizes of point cloud range? If the solution is still viable, do I need to configure it otherwise?
Thank you very much and look forward to hearing from you.