Training nanodet from scratch

Cloudz333 commented 3 years ago

Hi, I'm training NanoDet-m model (ShuffleNetV2 1.0x | 320*320) from scratch with Coco dataset and 4 GeForce RTX 2080 Ti. Convergence seems pretty slow, it could take 1-2 weeks.

May I ask how long did it takes for you to reach 20.6 mAP, and which setup did you use?

Thank you.

RangiLyu commented 3 years ago

I trained nanodet-m on Coco with one 3090 24G GPU and one i9 9900k CPU. The training process toke about 30 hours using the training configuration in config folder.

xv1nx commented 3 years ago

have you tried to train from scratch with multiple GPUs? I am trying to train from scratch using the default configuration and COCO dataset, but with 4 GPUS. The mAPs remain very low.

Screenshot 2021-04-23 185857 Screenshot 2021-04-23 190039

save_dir: ../workspace/nanodet_shufflenet_rgb model: arch: name: OneStageDetector backbone: name: ShuffleNetV2 model_size: 1.0x out_stages: [2,3,4] activation: LeakyReLU fpn: name: PAN in_channels: [116, 232, 464] out_channels: 96 start_level: 0 num_outs: 3 head: name: NanoDetHead num_classes: 80 input_channel: 96 feat_channels: 96 stacked_convs: 2 share_cls_reg: True octave_base_scale: 5 scales_per_octave: 1 strides: [8, 16, 32] reg_max: 7 norm_cfg: type: BN loss: loss_qfl: name: QualityFocalLoss use_sigmoid: True beta: 2.0 loss_weight: 1.0 loss_dfl: name: DistributionFocalLoss loss_weight: 0.25 loss_bbox: name: GIoULoss loss_weight: 2.0 data: train: name: coco img_path: /media/work/Data/raw/public_datasets/coco/train2017 ann_path: /media/work/Data/raw/public_datasets/coco/annotations/stuff_train2017.json input_size: [320,320] #[w,h] keep_ratio: True pipeline: perspective: 0.0 scale: [0.6, 1.4] stretch: [[1, 1], [1, 1]] rotation: 0 shear: 0 translate: 0.2 flip: 0.5 brightness: 0.2 contrast: [0.8, 1.2] saturation: [0.8, 1.2] normalize: [[103.53, 116.28, 123.675], [57.375, 57.12, 58.395]] val: name: coco img_path: /media/work/Data/raw/public_datasets/coco/val2017 # TODO changed ann_path: /media/work/Data/raw/public_datasets/coco/annotations/stuff_val2017.json # TODO changed input_size: [320,320] #[w,h] keep_ratio: True pipeline: normalize: [[103.53, 116.28, 123.675], [57.375, 57.12, 58.395]] device: # TODO changed gpu_ids: [0,1,2,3] workers_per_gpu: 6 batchsize_per_gpu: 128 # 128 schedule:

resume:

load_model: /media/work/Workspaces/CAO/nanodet/demo/shufflenetv2_x1-5666bf0f80.pth optimizer: name: SGD lr: 0.14 momentum: 0.9 weight_decay: .0001 warmup: name: linear steps: 300 ratio: 0.1 total_epochs: 190 #190 lr_schedule: name: MultiStepLR milestones: [130,160,175,185] gamma: 0.1 val_intervals: 10
evaluator: name: CocoDetectionEvaluator save_key: mAP

log: interval: 10

class_names: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic_light', 'fire_hydrant', 'stop_sign', 'parking_meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports_ball', 'kite', 'baseball_bat', 'baseball_glove', 'skateboard', 'surfboard', 'tennis_racket', 'bottle', 'wine_glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot_dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted_plant', 'bed', 'dining_table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell_phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy_bear', 'hair_drier', 'toothbrush']

RangiLyu commented 3 years ago

Maybe there were some bugs after using pytorch lightning trainer. Try to use tools/deprecated/train.py to see if there is still a problem.

Cloudz333 commented 3 years ago

I tried now to train with:

Script: tools/deprecated/train.py
Config File: nanodet-m.yml
Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth
Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.
1 GPU

EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side.

The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning.

Thank you!

AlbertiPot commented 3 years ago

I tried now to train with:

Script: tools/deprecated/train.py

Config File: nanodet-m.yml

Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth

Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.

1 GPU

EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side.

The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning.

Thank you!

Both single GPU and DDP mode with 2 GPUs are convergent when training with tools/deprecated/train.py :

single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280
DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280

However, the distributed training with pytorch-lighting(tools/train.py) does not converge, mAP reporting 0.074 at epoch 190

RangiLyu commented 3 years ago

I tried now to train with:

Script: tools/deprecated/train.py

Config File: nanodet-m.yml

Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth

Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.

1 GPU

EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side. The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning. Thank you!

Both single GPU and DDP mode with 2 GPUs are convergent when training with tools/deprecated/train.py :

single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280

DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280

However, the distributed training with pytorch-lighting(tools/train.py) does not converge, mAP reporting 0.074 at epoch 190

Thanks for your work. I'm sorry that I did not test the multi-GPU training with Pytorch lightning because I only have one GPU currently. I'm not sure whether it is PyTorch lightning's bug. I'll be grateful if anyone can find out what's going wrong with the new trainer.

RangiLyu commented 3 years ago

I tried now to train with:

Script: tools/deprecated/train.py

Config File: nanodet-m.yml

Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth

Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.

1 GPU

EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side. The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning. Thank you!

Both single GPU and DDP mode with 2 GPUs are convergent when training with tools/deprecated/train.py :

single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280

DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280

However, the distributed training with pytorch-lighting(tools/train.py) does not converge, mAP reporting 0.074 at epoch 190

Could you help to use one GPU to test the checkpoint which trained with lightning's multi-GPU backend to find out if the bug is in the training process or in the evaluation process?

RangiLyu commented 3 years ago

I found that when using pytorch-lightning's DDP mode, the validation_epoch_end function won't gather all results across GPU. This results in the validation bug. But the training process seems to be okay. Refer to a similar issue https://github.com/PyTorchLightning/pytorch-lightning/issues/7697

RangiLyu commented 3 years ago

Fixed in #254

AlbertiPot commented 3 years ago

I tried now to train with:

Script: tools/deprecated/train.py

Config File: nanodet-m.yml

Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth

Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.

1 GPU

EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side. The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning. Thank you!

Both single GPU and DDP mode with 2 GPUs are convergent when training with tools/deprecated/train.py :

single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280

DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280

However, the distributed training with pytorch-lighting(tools/train.py) does not converge, mAP reporting 0.074 at epoch 190

Could you help to use one GPU to test the checkpoint which trained with lightning's multi-GPU backend to find out if the bug is in the training process or in the evaluation process?

Hi, sorry for the late. The validation result comes at mAP 0.14 with batch_size 96 or 192 both for single GPU, which is the same as the deprecated training settings. You are totally correct that the pytorch lightning DDP mode does have validation bugs

Thanks for your great works! Thanks for fixing the bugs!

RangiLyu / nanodet

Training nanodet from scratch #229

resume: