Closed Cloudz333 closed 3 years ago
I trained nanodet-m on Coco with one 3090 24G GPU and one i9 9900k CPU. The training process toke about 30 hours using the training configuration in config folder.
have you tried to train from scratch with multiple GPUs? I am trying to train from scratch using the default configuration and COCO dataset, but with 4 GPUS. The mAPs remain very low.
save_dir: ../workspace/nanodet_shufflenet_rgb model: arch: name: OneStageDetector backbone: name: ShuffleNetV2 model_size: 1.0x out_stages: [2,3,4] activation: LeakyReLU fpn: name: PAN in_channels: [116, 232, 464] out_channels: 96 start_level: 0 num_outs: 3 head: name: NanoDetHead num_classes: 80 input_channel: 96 feat_channels: 96 stacked_convs: 2 share_cls_reg: True octave_base_scale: 5 scales_per_octave: 1 strides: [8, 16, 32] reg_max: 7 norm_cfg: type: BN loss: loss_qfl: name: QualityFocalLoss use_sigmoid: True beta: 2.0 loss_weight: 1.0 loss_dfl: name: DistributionFocalLoss loss_weight: 0.25 loss_bbox: name: GIoULoss loss_weight: 2.0 data: train: name: coco img_path: /media/work/Data/raw/public_datasets/coco/train2017 ann_path: /media/work/Data/raw/public_datasets/coco/annotations/stuff_train2017.json input_size: [320,320] #[w,h] keep_ratio: True pipeline: perspective: 0.0 scale: [0.6, 1.4] stretch: [[1, 1], [1, 1]] rotation: 0 shear: 0 translate: 0.2 flip: 0.5 brightness: 0.2 contrast: [0.8, 1.2] saturation: [0.8, 1.2] normalize: [[103.53, 116.28, 123.675], [57.375, 57.12, 58.395]] val: name: coco img_path: /media/work/Data/raw/public_datasets/coco/val2017 # TODO changed ann_path: /media/work/Data/raw/public_datasets/coco/annotations/stuff_val2017.json # TODO changed input_size: [320,320] #[w,h] keep_ratio: True pipeline: normalize: [[103.53, 116.28, 123.675], [57.375, 57.12, 58.395]] device: # TODO changed gpu_ids: [0,1,2,3] workers_per_gpu: 6 batchsize_per_gpu: 128 # 128 schedule:
resume:
load_model: /media/work/Workspaces/CAO/nanodet/demo/shufflenetv2_x1-5666bf0f80.pth optimizer: name: SGD lr: 0.14 momentum: 0.9 weight_decay: .0001 warmup: name: linear steps: 300 ratio: 0.1 total_epochs: 190 #190 lr_schedule: name: MultiStepLR milestones: [130,160,175,185] gamma: 0.1 val_intervals: 10
evaluator: name: CocoDetectionEvaluator save_key: mAPlog: interval: 10
class_names: ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic_light', 'fire_hydrant', 'stop_sign', 'parking_meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports_ball', 'kite', 'baseball_bat', 'baseball_glove', 'skateboard', 'surfboard', 'tennis_racket', 'bottle', 'wine_glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot_dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted_plant', 'bed', 'dining_table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell_phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy_bear', 'hair_drier', 'toothbrush']
Maybe there were some bugs after using pytorch lightning trainer. Try to use tools/deprecated/train.py to see if there is still a problem.
I tried now to train with:
EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side.
The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning.
Thank you!
I tried now to train with:
- Script: tools/deprecated/train.py
- Config File: nanodet-m.yml
- Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth
- Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.
- 1 GPU
EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side.
The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning.
Thank you!
Both single GPU and DDP mode with 2 GPUs are convergent when training with tools/deprecated/train.py
:
single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280
DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280
However, the distributed training with pytorch-lighting(tools/train.py
) does not converge, mAP reporting 0.074 at epoch 190
I tried now to train with:
- Script: tools/deprecated/train.py
- Config File: nanodet-m.yml
- Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth
- Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.
- 1 GPU
EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side. The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning. Thank you!
Both single GPU and DDP mode with 2 GPUs are convergent when training with
tools/deprecated/train.py
:
- single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280
- DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280
However, the distributed training with pytorch-lighting(
tools/train.py
) does not converge, mAP reporting 0.074 at epoch 190
Thanks for your work. I'm sorry that I did not test the multi-GPU training with Pytorch lightning because I only have one GPU currently. I'm not sure whether it is PyTorch lightning's bug. I'll be grateful if anyone can find out what's going wrong with the new trainer.
I tried now to train with:
- Script: tools/deprecated/train.py
- Config File: nanodet-m.yml
- Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth
- Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.
- 1 GPU
EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side. The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning. Thank you!
Both single GPU and DDP mode with 2 GPUs are convergent when training with
tools/deprecated/train.py
:
- single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280
- DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280
However, the distributed training with pytorch-lighting(
tools/train.py
) does not converge, mAP reporting 0.074 at epoch 190
Could you help to use one GPU to test the checkpoint which trained with lightning's multi-GPU backend to find out if the bug is in the training process or in the evaluation process?
I found that when using pytorch-lightning's DDP mode, the validation_epoch_end function won't gather all results across GPU. This results in the validation bug. But the training process seems to be okay. Refer to a similar issue https://github.com/PyTorchLightning/pytorch-lightning/issues/7697
Fixed in #254
I tried now to train with:
- Script: tools/deprecated/train.py
- Config File: nanodet-m.yml
- Pretrained Backbone: shufflenetv2_x1-5666bf0f80.pth
- Data: COCO train2017 (18GB) & val2017 (1GB), with annotations "2017 Train/Val annotations [241MB]" instances_train2017.json & instances_val2017.json.
- 1 GPU
EDIT: now it converges with tools/deprecated/train.py, the problem was an inconsistency from my side. The training with the new train.py script (multi gpu) is still not converging. As you mentioned, there might be some bugs with pytorch lightning. Thank you!
Both single GPU and DDP mode with 2 GPUs are convergent when training with
tools/deprecated/train.py
:
- single GPU (lr 0.14 & batch_size 192) reaches mAP 0.2047 at epoch 280
- DDP with 2 GPUs (lr 0.14 & batch_size 96 per GPU) reaches mAP 0.2034 at epoch 280
However, the distributed training with pytorch-lighting(
tools/train.py
) does not converge, mAP reporting 0.074 at epoch 190Could you help to use one GPU to test the checkpoint which trained with lightning's multi-GPU backend to find out if the bug is in the training process or in the evaluation process?
Hi, sorry for the late. The validation result comes at mAP 0.14 with batch_size 96 or 192 both for single GPU, which is the same as the deprecated training settings. You are totally correct that the pytorch lightning DDP mode does have validation bugs
Thanks for your great works! Thanks for fixing the bugs!
Hi, I'm training NanoDet-m model (ShuffleNetV2 1.0x | 320*320) from scratch with Coco dataset and 4 GeForce RTX 2080 Ti. Convergence seems pretty slow, it could take 1-2 weeks.
May I ask how long did it takes for you to reach 20.6 mAP, and which setup did you use?
Thank you.