Too much training time, Any faster training schedule?

Media-Smart / vedadet

A single stage object detection toolbox based on PyTorch

Apache License 2.0

498 stars 128 forks source link

Too much training time, Any faster training schedule? #5

Closed zehuichen123 closed 3 years ago

zehuichen123 commented 3 years ago

Hi, Thanks for this great work. However, the training time is quite long, (about 60h on 6 V100s) which is hard for us to verify other ideas on this code. Have you ever tried some shorter schedule? and how about the performance?

hxcai commented 3 years ago

@zehuichen123 Wow，since u have 6 V100 GPUs, I think u could modify some configuration. 1. larger batch size and corresponding learning rate. 2. if batch size is large enough, u can change GN to BN or SyncBN. 3.u can test every 30 epoches to stop earlier because of sgdr scheduler.

zehuichen123 commented 3 years ago

Thanks for your advice! I'd give it a try. Besides, if I train with 6 GPUs, then the learning rate should be adjusted to 2 x 3.75e-3 , right?

hxcai commented 3 years ago

@zehuichen123 The default learning rate in config files is for 3 GPUs and 4 img/gpu (batch size = 34 = 12), according to Linear Scaling Rule, you need to set the learning rate proportional to the batch size if you use different GPUs or images per GPU, e.g., lr=2 x 3.75e-3 for 6 GPUs 4 img/gpu.

jiankangdeng commented 3 years ago

lr=2 x 3.75e-3 for 6 GPUs * 4 img/gpu

epoch, easy, medium, hard, all-ap 30, 0.9281, 0.9218, 0.8903, 0.780 60, 0.9492, 0.9426, 0.9130, 0.805 90, 0.9549, 0.9485, 0.9194, 0.812 120, 0.9588, 0.9514, 0.9225, 0.817
150, 0.9609, 0.9542, 0.9258, 0.817 180, 0.9606, 0.9547, 0.9272, 0.818 210, 0.9622, 0.9561, 0.9281, 0.822 240, 0.9623, 0.9557, 0.9290, 0.823 270, 0.9608, 0.9547, 0.9288, 0.823 300, 0.9616, 0.9555, 0.9276, 0.823 330, 0.9612, 0.9553, 0.9294, 0.823 360, 0.9609, 0.9552, 0.9289, 0.824 390, 0.9637, 0.9571, 0.9294, 0.824 420, 0.9640, 0.9575, 0.9303, 0.824 450, 0.9634, 0.9575, 0.9303, 0.826 480, 0.9635, 0.9569, 0.9290, 0.823 510, 0.9633, 0.9573, 0.9303, 0.826 540, 0.9625, 0.9558, 0.9295, 0.824 570, 0.9629, 0.9557, 0.9293, 0.824 600, 0.9628, 0.9569, 0.9304, 0.826 630, 0.9619, 0.9558, 0.9287, 0.824

Performance is evaluated by: https://github.com/wondervictor/WiderFace-Evaluation

There may be variations in the above python evaluation code.

mifan0208 commented 3 years ago

@jiankangdeng you start training from scratch or finetuning？ I trained from scratch ,but the hard is 90.3 and the improvement is not obvious.

zehuichen123 commented 3 years ago

@jiankangdeng you start training from scratch or finetuning？ I trained from scratch ,but the hard is 90.3 and the improvement is not obvious.

Only pretrained res50 backbone from official PyTorch. I got almost the same result with jiankangdeng.

mifan0208 commented 3 years ago

@zehuichen123 thanks for your reply,I achieve the similar effort.But I have an another question that the loss shock violent, can you help me solve this question?

zehuichen123 commented 3 years ago

@mifan0208 This is correlated with the lr schedule. Tinaface adopts the cosine restart schedule, which leads to the increase of loss every 30 epochs.

mifan0208 commented 3 years ago

@zehuichen123 Ok, thanks for your reply.