x1.5 model consistently underperforming

raember commented 1 year ago

I'm using NanoDet-Plus for research purposes but ran into a weird issue where the x1.5 model variants consistently underperform compared to the x1.0 models. This happens on VISEM, Argoverse-HD, and even COCO2017. For COCO2017 I used the stock config provided in the git repo (x1.0 and x1.5), but the COCO mAP already separates the two after the first 10 epochs, with the bigger variant consistently scoring lower. This holds true for all three datasets from my observations:

I get the following mAP metrics:

	x1.0	x1.5	epochs
VISEM	6.9%	2.0%	100
Argoverse-HD	24.1%	21.2%	70
COCO2017	20.9%	14.5%	40*

*The models are still training as of now, but the separation in mAP mentioned above is already distinctly noticeable in the logs. I will let the runs continue until 300 epochs have been reached, as the stock config dictates.

Here are the configs I used for VISEM and Argoverse-HD:

VISEM x1.0

```yml save_dir: workspace/baseline/visem/nanodet-plus-m-1.0x-dgxa100 model: arch: backbone: name: ShuffleNetV2 model_size: 1.0x out_stages: [2, 3, 4] activation: LeakyReLU channels: 3 fpn: name: GhostPAN in_channels: [116, 232, 464] out_channels: 96 kernel_size: 5 num_extra_level: 1 use_depthwise: true activation: LeakyReLU head: name: NanoDetPlusHead num_classes: 3 input_channel: 96 feat_channels: 96 stacked_convs: 2 kernel_size: 5 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 norm_cfg: {type: BN} loss: loss_qfl: {name: QualityFocalLoss, use_sigmoid: true, beta: 2.0, loss_weight: 1.0} loss_dfl: {name: DistributionFocalLoss, loss_weight: 0.25} loss_bbox: {name: GIoULoss, loss_weight: 2.0} name: NanoDetPlus detach_epoch: 10 aux_head: name: SimpleConvHead num_classes: 3 input_channel: 192 feat_channels: 192 stacked_convs: 4 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 weight_averager: {name: ExpMovingAverager, decay: 0.9998} device: precision: 16 gpu_ids: [0] workers_per_gpu: 28 batchsize_per_gpu: 92 schedule: optimizer: {name: AdamW, lr: 0.001, weight_decay: 1.0e-05} warmup: {name: linear, steps: 500, ratio: 0.0001} total_epochs: 10 lr_schedule: {name: CosineAnnealingLR, T_max: 300, eta_min: 5.0e-05} val_intervals: 10 log: {interval: 50} test: {} grad_clip: 35 evaluator: {name: CocoDetectionEvaluator, save_key: mAP} class_names: &id001 [sperm, cluster, small/pinhead] data: train: name: VISEMDataset pipeline: perspective: 0 scale: [0.8, 1.2] stretch: - [0.95, 1.05] - [0.95, 1.05] rotation: 180 shear: 10 translate: 0.2 flip: 0.5 brightness: 0.2 contrast: [0.6, 1.4] saturation: [0.6, 1.2] normalize: &id002 - [123.675, 116.28, 103.53] - [58.395, 57.12, 57.375] img_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train ann_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train class_names: *id001 input_size: &id003 [640, 480] keep_ratio: false val: name: VISEMDataset pipeline: normalize: *id002 img_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train ann_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train class_names: *id001 input_size: *id003 keep_ratio: false ```

VISEM x1.5

```yml save_dir: workspace/baseline/visem/nanodet-plus-m-1.5x-dgxa100 model: arch: backbone: name: ShuffleNetV2 model_size: 1.5x out_stages: [2, 3, 4] activation: LeakyReLU channels: 3 fpn: name: GhostPAN in_channels: [176, 352, 704] out_channels: 128 kernel_size: 5 num_extra_level: 1 use_depthwise: true activation: LeakyReLU head: name: NanoDetPlusHead num_classes: 3 input_channel: 128 feat_channels: 128 stacked_convs: 2 kernel_size: 5 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 norm_cfg: {type: BN} loss: loss_qfl: {name: QualityFocalLoss, use_sigmoid: true, beta: 2.0, loss_weight: 1.0} loss_dfl: {name: DistributionFocalLoss, loss_weight: 0.25} loss_bbox: {name: GIoULoss, loss_weight: 2.0} name: NanoDetPlus detach_epoch: 10 aux_head: name: SimpleConvHead num_classes: 3 input_channel: 256 feat_channels: 256 stacked_convs: 4 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 weight_averager: {name: ExpMovingAverager, decay: 0.9998} device: precision: 16 gpu_ids: [0] workers_per_gpu: 28 batchsize_per_gpu: 88 schedule: optimizer: {name: AdamW, lr: 0.001, weight_decay: 1.0e-05} warmup: {name: linear, steps: 500, ratio: 0.0001} total_epochs: 10 lr_schedule: {name: CosineAnnealingLR, T_max: 300, eta_min: 5.0e-05} val_intervals: 10 log: {interval: 50} test: {} grad_clip: 35 evaluator: {name: CocoDetectionEvaluator, save_key: mAP} class_names: &id001 [sperm, cluster, small/pinhead] data: train: name: VISEMDataset pipeline: perspective: 0 scale: [0.8, 1.2] stretch: - [0.95, 1.05] - [0.95, 1.05] rotation: 180 shear: 10 translate: 0.2 flip: 0.5 brightness: 0.2 contrast: [0.6, 1.4] saturation: [0.6, 1.2] normalize: &id002 - [123.675, 116.28, 103.53] - [58.395, 57.12, 57.375] img_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train ann_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train class_names: *id001 input_size: &id003 [640, 480] keep_ratio: false val: name: VISEMDataset pipeline: normalize: *id002 img_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train ann_path: ../data/VISEM/VISEM_Tracking_Train_v4/Train class_names: *id001 input_size: *id003 keep_ratio: false ```

Argoverse-HD x1.0

```yml save_dir: workspace/baseline/argoverse/nanodet-plus-m-1.0x-dgxa100 model: arch: backbone: name: ShuffleNetV2 model_size: 1.0x out_stages: [2, 3, 4] activation: LeakyReLU channels: 3 fpn: name: GhostPAN in_channels: [116, 232, 464] out_channels: 96 kernel_size: 5 num_extra_level: 1 use_depthwise: true activation: LeakyReLU head: name: NanoDetPlusHead num_classes: 8 input_channel: 96 feat_channels: 96 stacked_convs: 2 kernel_size: 5 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 norm_cfg: {type: BN} loss: loss_qfl: {name: QualityFocalLoss, use_sigmoid: true, beta: 2.0, loss_weight: 1.0} loss_dfl: {name: DistributionFocalLoss, loss_weight: 0.25} loss_bbox: {name: GIoULoss, loss_weight: 2.0} name: NanoDetPlus detach_epoch: 10 aux_head: name: SimpleConvHead num_classes: 8 input_channel: 192 feat_channels: 192 stacked_convs: 4 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 weight_averager: {name: ExpMovingAverager, decay: 0.9998} device: precision: 16 gpu_ids: [0] workers_per_gpu: 28 batchsize_per_gpu: 6 schedule: optimizer: {name: AdamW, lr: 0.0003, weight_decay: 0.01} warmup: {name: linear, steps: 500, ratio: 0.0001} total_epochs: 70 lr_schedule: {name: CosineAnnealingLR, T_max: 300, eta_min: 5.0e-05} val_intervals: 10 log: {interval: 50} test: {} grad_clip: 35 evaluator: {name: CocoDetectionEvaluator, save_key: mAP} class_names: &id001 [person, bicycle, car, motorcycle, bus, truck, traffic_light, stop_sign] data: train: name: ArgoverseDataset pipeline: perspective: 0 scale: [0.8, 1.2] stretch: - [0.95, 1.05] - [0.95, 1.05] rotation: 0 shear: 0 translate: 0.1 flip: 0 brightness: 0.2 contrast: [0.6, 1.4] saturation: [0.6, 1.2] normalize: &id002 - [123.675, 116.28, 103.53] - [58.395, 57.12, 57.375] class_names: *id001 input_size: &id003 [1680, 1050] keep_ratio: true img_path: ../data/Argoverse-1.1/tracking/train ann_path: ../data/Argoverse-HD/annotations/train.json val: name: ArgoverseDataset pipeline: normalize: *id002 class_names: *id001 input_size: *id003 keep_ratio: true img_path: ../data/Argoverse-1.1/tracking/val ann_path: ../data/Argoverse-HD/annotations/val.json ```

Argoverse-HD x1.5

```yml save_dir: workspace/baseline/argoverse/nanodet-plus-m-1.5x-dgxa100 model: arch: backbone: name: ShuffleNetV2 model_size: 1.5x out_stages: [2, 3, 4] activation: LeakyReLU channels: 3 fpn: name: GhostPAN in_channels: [176, 352, 704] out_channels: 128 kernel_size: 5 num_extra_level: 1 use_depthwise: true activation: LeakyReLU head: name: NanoDetPlusHead num_classes: 8 input_channel: 128 feat_channels: 128 stacked_convs: 2 kernel_size: 5 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 norm_cfg: {type: BN} loss: loss_qfl: {name: QualityFocalLoss, use_sigmoid: true, beta: 2.0, loss_weight: 1.0} loss_dfl: {name: DistributionFocalLoss, loss_weight: 0.25} loss_bbox: {name: GIoULoss, loss_weight: 2.0} name: NanoDetPlus detach_epoch: 10 aux_head: name: SimpleConvHead num_classes: 8 input_channel: 256 feat_channels: 256 stacked_convs: 4 strides: [8, 16, 32, 64] activation: LeakyReLU reg_max: 7 weight_averager: {name: ExpMovingAverager, decay: 0.9998} device: precision: 16 gpu_ids: [0] workers_per_gpu: 28 batchsize_per_gpu: 6 schedule: optimizer: {name: AdamW, lr: 0.0003, weight_decay: 0.01} warmup: {name: linear, steps: 500, ratio: 0.0001} total_epochs: 70 lr_schedule: {name: CosineAnnealingLR, T_max: 300, eta_min: 5.0e-05} val_intervals: 10 log: {interval: 50} test: {} grad_clip: 35 evaluator: {name: CocoDetectionEvaluator, save_key: mAP} class_names: &id001 [person, bicycle, car, motorcycle, bus, truck, traffic_light, stop_sign] data: train: name: ArgoverseDataset pipeline: perspective: 0 scale: [0.8, 1.2] stretch: - [0.95, 1.05] - [0.95, 1.05] rotation: 0 shear: 0 translate: 0.1 flip: 0 brightness: 0.2 contrast: [0.6, 1.4] saturation: [0.6, 1.2] normalize: &id002 - [123.675, 116.28, 103.53] - [58.395, 57.12, 57.375] class_names: *id001 input_size: &id003 [1680, 1050] keep_ratio: true img_path: ../data/Argoverse-1.1/tracking/train ann_path: ../data/Argoverse-HD/annotations/train.json val: name: ArgoverseDataset pipeline: normalize: *id002 class_names: *id001 input_size: *id003 keep_ratio: true img_path: ../data/Argoverse-1.1/tracking/val ann_path: ../data/Argoverse-HD/annotations/val.json ```

Is this erroneous behavior? Is there something wrong with my setup?

RangiLyu commented 11 months ago

This may be due to the fact that the 1.5x model is not initialized with imagenet pre-training https://github.com/RangiLyu/nanodet/blob/3c9607c043cb24523149ffe42a5677601e6da6d0/nanodet/model/backbone/shufflenetv2.py#L9-L10

raember commented 11 months ago

Hi, and thanks for your response. I don't know how I could have overlooked this, but evidently I did. In the meantime, I tried replicating your metrics on COCO, with this result:

	x1.0	x1.5	epochs
COCO2017	26.5%	28.9%	300

So I also trained the same models on VISEM for 300 epochs, and although in the end, the x1.5 model has a higher mAP, it's because the x1.0 model drops in performance:

	x1.0	x1.5	epochs
VISEM	5.7%	5.9%	300

My interpretation is that the shufflenet backbone struggles with the nature of the images, as they are microscopic recordings, which an ImageNet pre-training does not generalize well to. I was not able to train it on Argoverse-HD for 300 epochs yet, as that requires a substantial amount of time, but as seen in my initial post, it benefits way more from the pre-trained weights. I'm wondering if it will also see a drop in mAP of the x1.0 model though, just like with VISEM.

Now, after looking into the matter of the missing weights for the backbone, I found the following issues: Pre-trained shufflenetv2 checkpoints (x1.5 and x2.0) not being supported links to an issue about adding more pre-trained weights, which has a link to a merged PR adding them to the repo. I'll make a PR about this in a bit, but at least this means good news, since this should help mitigate this performance disparity.

RangiLyu / nanodet

x1.5 model consistently underperforming #524