dvlab-research / LargeKernel3D

LargeKernel3D: Scaling up Kernels in 3D Sparse CNNs (CVPR 2023)
https://arxiv.org/abs/2206.10555
Apache License 2.0
197 stars 8 forks source link

About detection training gpu num #9

Open fjzpcmj opened 1 year ago

fjzpcmj commented 1 year ago

Dear Author, I train a detection model with config "nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_large.py" with 8GPUs,the model performance is lower than reported. Do you know what's wrong with my trained model?

performance detail: mAP: 0.5944 mATE: 0.2902 mASE: 0.2516 mAOE: 0.3343 mAVE: 0.2870 mAAE: 0.1911 NDS: 0.6618 Eval time: 104.5s

Per-class results: Object Class AP ATE ASE AOE AVE AAE car 0.851 0.181 0.153 0.110 0.267 0.193 truck 0.563 0.331 0.180 0.116 0.255 0.227 bus 0.707 0.333 0.177 0.073 0.487 0.268 trailer 0.393 0.500 0.201 0.564 0.221 0.183 construction_vehicle 0.199 0.713 0.434 1.025 0.123 0.312 pedestrian 0.845 0.147 0.272 0.389 0.219 0.099 motorcycle 0.583 0.202 0.238 0.240 0.499 0.231 bicycle 0.415 0.162 0.268 0.414 0.225 0.016 traffic_cone 0.690 0.136 0.322 nan nan nan barrier 0.698 0.198 0.270 0.077 nan nan Evaluation nusc: Nusc v1.0-trainval Evaluation car Nusc dist AP@0.5, 1.0, 2.0, 4.0 75.43, 85.77, 88.99, 90.23 mean AP: 0.8510555593669333 truck Nusc dist AP@0.5, 1.0, 2.0, 4.0 37.71, 54.44, 64.86, 68.22 mean AP: 0.5630814131474254 construction_vehicle Nusc dist AP@0.5, 1.0, 2.0, 4.0 3.41, 12.18, 27.07, 37.13 mean AP: 0.19945290009625813 bus Nusc dist AP@0.5, 1.0, 2.0, 4.0 46.04, 69.87, 82.21, 84.80 mean AP: 0.7072839081535547 trailer Nusc dist AP@0.5, 1.0, 2.0, 4.0 11.06, 35.42, 50.44, 60.39 mean AP: 0.39329063939703013 barrier Nusc dist AP@0.5, 1.0, 2.0, 4.0 59.75, 70.12, 73.86, 75.32 mean AP: 0.6976273792134366 motorcycle Nusc dist AP@0.5, 1.0, 2.0, 4.0 52.38, 59.28, 60.53, 61.11 mean AP: 0.5832459929266106 bicycle Nusc dist AP@0.5, 1.0, 2.0, 4.0 40.47, 41.59, 41.82, 41.98 mean AP: 0.41463017310991623 pedestrian Nusc dist AP@0.5, 1.0, 2.0, 4.0 82.15, 83.86, 85.29, 86.59 mean AP: 0.8447317287125544 traffic_cone Nusc dist AP@0.5, 1.0, 2.0, 4.0 66.22, 67.69, 69.40, 72.53 mean AP: 0.6896184124816187

yukang2017 commented 1 year ago

Hi @fjzpcmj ,

Thanks for your interests in our work. Sorry for the late reply. I have some deadline this week. I will check the nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_large.py.

I used 4 GPUs for training. Would you please have a try on nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_tiny.py? The performance of it is more stable.

Regards, Yukang Chen

fjzpcmj commented 1 year ago

Thanks for your reply, I will try on try on nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_largekernel3d_tiny.py. Would you like to tell me that which performance is better, "large" v.s. “tiny” ?

yukang2017 commented 1 year ago

Thanks for your message. Generally, "large" performs a bit better than "tiny" (less than 0.5 mAP). But "tiny" is more stable and faster.

fjzpcmj commented 1 year ago

Thanks very much.

fjzpcmj commented 1 year ago

Dear @yukang2017 ,I train “tiny” model with 4GPUs,the mAP result is still 59, less than reported 63.3. Do you know what's the matter? In addition,I download the pretrained model (63.3mAP) and testing. It seems the downloaded model structure is different from the structure in the "tiny" config.


The model and loaded state dict do not match exactly

unexpected key in source state_dict: backbone.conv1.0.conv1.conv3x3_1.weight, backbone.conv1.0.conv1.conv3x3_1.bias, backbone.conv1.0.conv2.conv3x3_1.weight, backbone.conv1.0.conv2.conv3x3_1.bias, backbone.conv1.1.conv1.conv3x3_1.weight, backbone.conv1.1.conv1.conv3x3_1.bias, backbone.conv1.1.conv2.conv3x3_1.weight, backbone.conv1.1.conv2.conv3x3_1.bias, backbone.conv2.3.conv1.weight, backbone.conv2.3.conv1.bias, backbone.conv2.3.conv2.weight, backbone.conv2.3.conv2.bias, backbone.conv2.4.conv1.weight, backbone.conv2.4.conv1.bias, backbone.conv2.4.conv2.weight, backbone.conv2.4.conv2.bias, backbone.conv3.3.conv1.weight, backbone.conv3.3.conv1.bias, backbone.conv3.3.conv2.weight, backbone.conv3.3.conv2.bias, backbone.conv3.4.conv1.weight, backbone.conv3.4.conv1.bias, backbone.conv3.4.conv2.weight, backbone.conv3.4.conv2.bias

missing keys in source state_dict: backbone.conv2.4.conv2.block.weight, backbone.conv1.1.conv2.block.position_embedding, backbone.conv3.4.conv2.block.bias, backbone.conv2.4.conv1.block.weight, backbone.conv3.4.conv1.conv3x3_1.weight, backbone.conv2.4.conv2.conv3x3_1.weight, backbone.conv3.3.conv1.conv3x3_1.weight, backbone.conv2.3.conv1.conv3x3_1.bias, backbone.conv1.1.conv1.block.position_embedding, backbone.conv3.3.conv1.block.weight, backbone.conv3.3.conv2.conv3x3_1.weight, backbone.conv3.4.conv2.conv3x3_1.bias, backbone.conv2.3.conv1.conv3x3_1.weight, backbone.conv3.4.conv1.block.bias, backbone.conv3.4.conv1.block.weight, backbone.conv2.3.conv2.conv3x3_1.bias, backbone.conv1.0.conv1.block.position_embedding, backbone.conv3.3.conv1.conv3x3_1.bias, backbone.conv2.4.conv2.block.bias, backbone.conv3.3.conv2.block.bias, backbone.conv3.4.conv1.conv3x3_1.bias, backbone.conv2.4.conv1.conv3x3_1.bias, backbone.conv3.3.conv2.conv3x3_1.bias, backbone.conv2.3.conv2.conv3x3_1.weight, backbone.conv2.3.conv2.block.weight, backbone.conv2.4.conv1.block.bias, backbone.conv1.0.conv2.block.position_embedding, backbone.conv3.4.conv2.block.weight, backbone.conv2.3.conv1.block.bias, backbone.conv2.3.conv2.block.bias, backbone.conv3.3.conv1.block.bias, backbone.conv2.4.conv1.conv3x3_1.weight, backbone.conv3.3.conv2.block.weight, backbone.conv2.4.conv2.conv3x3_1.bias, backbone.conv2.3.conv1.block.weight, backbone.conv3.4.conv2.conv3x3_1.weight

these keys have mismatched shape: +-------------------------------------+---------------------------------+---------------------------------+ | key | expected shape | loaded shape | +-------------------------------------+---------------------------------+---------------------------------+ | backbone.conv1.0.conv1.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv1.0.conv2.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv1.1.conv1.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv1.1.conv2.block.weight | torch.Size([3, 3, 3, 16, 16]) | torch.Size([7, 7, 7, 16, 16]) | | backbone.conv4.3.conv1.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | | backbone.conv4.3.conv2.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | | backbone.conv4.4.conv1.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | | backbone.conv4.4.conv2.weight | torch.Size([128, 3, 3, 3, 128]) | torch.Size([5, 5, 5, 128, 128]) | +-------------------------------------+---------------------------------+---------------------------------+

yukang2017 commented 1 year ago

Hi,

For training, to reproduce, please disable the gt sampling augmentation in the last 5 epochs, this is a detailed trick, listed in the implementation details.

For testing, sorry for this misalignment, I double check the config file. There are some typos. I fixed it to be aligned with the checkpoint, please try it again.

fjzpcmj commented 1 year ago

thanks very much. I have reproduced the result