ShuangXieIrene / ssds.pytorch

Repository for Single Shot MultiBox Detector and its variants, implemented with pytorch, python3.
MIT License
571 stars 167 forks source link

Loss is not decreasing #43

Open lucasjinreal opened 5 years ago

lucasjinreal commented 5 years ago

I have trained ssd with mobilenetv2 on VOC but after almost 500 epochs, the loss is still like this:

517/518 in 0.154s [##########] | loc_loss: 1.4773 cls_loss: 2.3165

==>Train: || Total_time: 79.676s || loc_loss: 1.1118 conf_loss: 2.3807 || lr: 0.000721

Wrote snapshot to: ./experiments/models/ssd_mobilenet_v2_voc/ssd_lite_mobilenet_v2_voc_epoch_525.pth
Epoch 526/1300:
0/518 in 0.193s [----------] | loc_loss: 0.8291 cls_loss: 1.9464
1/518 in 0.186s [----------] | loc_loss: 1.3181 cls_loss: 2.5404
2/518 in 0.184s [----------] | loc_loss: 1.0371 cls_loss: 2.2243

It's doesn't change and loss is very hight...... What's the problem with implementation?

1453042287 commented 5 years ago

did you load the pre-train weight? it works fine with my dataset

1453042287 commented 5 years ago

or maybe you didn't change the mode is train or test in the config file

blueardour commented 5 years ago

@jinfagang Have you solved the problem? I have the same issue.

@1453042287 I trained the yolov2-mobilenet-v2 from stratch. U mentioned 'pre-trained model', do y mean the pre-trained bone network model (such as the mobilenetv2) or both bone model and detection model? In my training, all the parameters are not pre trained.

1453042287 commented 5 years ago

@blueardour first, make sure you change the PHASE in .yml file to 'train', then ,actually, i believe it's inappropriate to train a model from scratch, so at least, you should load the pre-train backbone, i just utilize the whole pre-train weight(including backbone and extract and so on..) the author provided, but i set the RESUME_SCOPE in the .yml file to be 'base' only and the resault is almost the same as fine-tune's

blueardour commented 5 years ago

@1453042287 Hi, thanks for the advise. My current training seems working. In my previous training, I set 'base' and 'loc' so on all in the trainable_scope, and it does not give a good result. After only reload the 'base' and retrain other parameters, I successfully recover the precision.

My only problem left is the speed for test. The nms in the test procedure seems very slow. It have been discussed in https://github.com/ShuangXieIrene/ssds.pytorch/issues/16. Yet no good solutions.

cvtower commented 5 years ago

@1453042287 Hi, thanks for the advise. My current training seems working. In my previous training, I set 'base' and 'loc' so on all in the trainable_scope, and it does not give a good result. After only reload the 'base' and retrain other parameters, I successfully recover the precision.

My only problem left is the speed for test. The nms in the test procedure seems very slow. It have been discussed in #16. Yet no good solutions.

@blueardour Hi,bellow is my test result of fssd_mobilenet_v2 on coco2017 using my config files instead of the given one. training from scratch without any pre-trained model. Shall i only reload the 'base' paras here?


 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.211
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.358
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.217
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.044
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.234
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.351
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.216
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.343
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.371
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.099
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.428
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590
cvtower commented 5 years ago

ok...seems like training from scratch might not be well supported. But i just want to use this repo to verify my network arch, and imagenet pre-trained model is still on training.

blueardour commented 5 years ago

Yes, set all parameter to re-trainable seems hard to converge. This year, Mr He did publish a paper named 'Rethinking ImageNet Pre-training' which claimed the pre-train on imagenet is not necessary. However, it is skillful to give a good initialization of the network.

cvtower commented 5 years ago

Yes, set all parameter to re-trainable seems hard to converge. This year, Mr He did publish a paper named 'Rethinking ImageNet Pre-training' which claimed the pre-train on imagenet is not necessary. However, it is skillful to give a good initialization of the network.

Yes, agree with you. I read that paper the day it is published. My own designed network outperform(imagenet/cifar...) several networks, however, the imagenet training is still going on(72.5 1.0). Also i have verified my network on other tasks and works fine, so i believe it will get better result on detection&&segmentation task too. Personally, i greatly agree with views from "Detnet" and "rethinking imagenet pre-training", however, seems like that much more computation cost and specific tuning skills are needed. Before my imagenet training finished, i will have to compare sdd performance based on models trained from scratch firstly.

blueardour commented 5 years ago

Hi, @1453042287 @cvtower

I have another issue about the train precision and loss curve. The following is the result from tensorboardX.

issue

It can be see that the precision slowly increase and meet a jump at around 89th epoch. I don't why the precision changes so dramatically at this point. The loc and cls loss as well the learning rate seem not change so much. Do you observe a similar phenomenon or do you have any explanation on it?

cvtower commented 5 years ago

Hi, @1453042287 @cvtower

I have another issue about the train precision and loss curve. The following is the result from tensorboardX.

issue

It can be see that the precision slowly increase and meet a jump at around 89th epoch. I don't why the precision changes so dramatically at this point. The loc and cls loss as well the learning rate seem not change so much. Do you observe a similar phenomenon or do you have any explanation on it?

Hi @blueardour,

I did not use the CosineAnnealing LR and no such phenomenon ever happened during training.

XiaSunny commented 5 years ago

您好,我想请问下:作者提供的pre-train weight文件,你是如何得到的,我没有weight目录,所以也没有预训练权重文件,还是您通过其他方式获得的?谢谢您! @1453042287

1453042287 commented 5 years ago

@XiaSunny 下载啊。。。就在这个repo的readme里面,蓝体字

XiaSunny commented 5 years ago

@1453042287 好的,谢谢你。

XiaSunny commented 5 years ago

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下: MODEL: SSDS: fssd NETS: vgg16 IMAGE_SIZE: [300, 300] NUM_CLASSES: 81 FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]], [['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]] STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]] SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]] ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN: MAX_EPOCHS: 500 CHECKPOINTS_EPOCHS: 1 BATCH_SIZE: 28 TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf' RESUME_SCOPE: 'base' OPTIMIZER: OPTIMIZER: sgd LEARNING_RATE: 0.001 MOMENTUM: 0.9 WEIGHT_DECAY: 0.0001 LR_SCHEDULER: SCHEDULER: SGDR WARM_UP_EPOCHS: 150

TEST: BATCH_SIZE: 64 TEST_SCOPE: [90, 100]

MATCHER: MATCHED_THRESHOLD: 0.5 UNMATCHED_THRESHOLD: 0.5 NEGPOS_RATIO: 3

POST_PROCESS: SCORE_THRESHOLD: 0.01 IOU_THRESHOLD: 0.6 MAX_DETECTIONS: 100

DATASET: DATASET: 'coco' DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco' TRAIN_SETS: [['2017', 'train']] TEST_SETS: [['2017', 'val']] PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco' LOG_DIR: './experiments/models/fssd_vgg16_coco' RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth' PHASE: ['train'] 另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

Damon2019 commented 4 years ago

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下: MODEL: SSDS: fssd NETS: vgg16 IMAGE_SIZE: [300, 300] NUM_CLASSES: 81 FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]], [['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]] STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]] SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]] ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN: MAX_EPOCHS: 500 CHECKPOINTS_EPOCHS: 1 BATCH_SIZE: 28 TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf' RESUME_SCOPE: 'base' OPTIMIZER: OPTIMIZER: sgd LEARNING_RATE: 0.001 MOMENTUM: 0.9 WEIGHT_DECAY: 0.0001 LR_SCHEDULER: SCHEDULER: SGDR WARM_UP_EPOCHS: 150

TEST: BATCH_SIZE: 64 TEST_SCOPE: [90, 100]

MATCHER: MATCHED_THRESHOLD: 0.5 UNMATCHED_THRESHOLD: 0.5 NEGPOS_RATIO: 3

POST_PROCESS: SCORE_THRESHOLD: 0.01 IOU_THRESHOLD: 0.6 MAX_DETECTIONS: 100

DATASET: DATASET: 'coco' DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco' TRAIN_SETS: [['2017', 'train']] TEST_SETS: [['2017', 'val']] PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco' LOG_DIR: './experiments/models/fssd_vgg16_coco' RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth' PHASE: ['train'] 另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

@XiaSunny 你好,我也遇到了你这个问题,请问你解决了吗

Damon2019 commented 4 years ago

@1453042287 @XiaSunny 你好,我想使用预训练模型

TRAINABLE_SCOPE: 'base,norm,extras,loc,conf' RESUME_SCOPE: 'base,norm,extras,loc,conf' 这里面的参数我应该如何修改? 谢谢!

XiaSunny commented 4 years ago

TRAINABLE_SCOPE指的是需要训练的范围RESUME_SCOPE指的是你需要从预训练模型中恢复的有哪些,首先应该把conf去掉(因为类别数不一样)其他的你根据实际情况看看还需要改不。发自我的华为手机-------- 原始邮件 --------发件人: Damon2019 notifications@github.com日期: 2019年9月18日周三 11:31收件人: "ShuangXieIrene/ssds.pytorch" ssds.pytorch@noreply.github.com抄送: XiaSunny lxjghxq@sina.com, Mention mention@noreply.github.com主 题: Re: [ShuangXieIrene/ssds.pytorch] Loss is not decreasing (#43)@1453042287 @XiaSunny 你好,我想使用预训练模型

TRAINABLE_SCOPE: 'base,norm,extras,loc,conf'

RESUME_SCOPE: 'base,norm,extras,loc,conf'

这里面的参数我应该如何修改? 谢谢!

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or mute the thread.

Bobby2090 commented 3 years ago

您好,我用的配置文件是fssd_vgg16_train_coco.yml,当我训练coco2017时conf_loss在5左右,loc_loss在2左右,一直不下去。我的配置文件如下: MODEL: SSDS: fssd NETS: vgg16 IMAGE_SIZE: [300, 300] NUM_CLASSES: 81 FEATURE_LAYER: [[[22, 34, 'S'], [512, 1024, 512]], [['', 'S', 'S', 'S', '', ''], [512, 512, 256, 256, 256, 256]]] STEPS: [[8, 8], [16, 16], [32, 32], [64, 64], [100, 100], [300, 300]] SIZES: [[30, 30], [60, 60], [111, 111], [162, 162], [213, 213], [264, 264], [315, 315]] ASPECT_RATIOS: [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2]]

TRAIN: MAX_EPOCHS: 500 CHECKPOINTS_EPOCHS: 1 BATCH_SIZE: 28 TRAINABLE_SCOPE: 'norm,extras,transforms,pyramids,loc,conf' RESUME_SCOPE: 'base' OPTIMIZER: OPTIMIZER: sgd LEARNING_RATE: 0.001 MOMENTUM: 0.9 WEIGHT_DECAY: 0.0001 LR_SCHEDULER: SCHEDULER: SGDR WARM_UP_EPOCHS: 150

TEST: BATCH_SIZE: 64 TEST_SCOPE: [90, 100]

MATCHER: MATCHED_THRESHOLD: 0.5 UNMATCHED_THRESHOLD: 0.5 NEGPOS_RATIO: 3

POST_PROCESS: SCORE_THRESHOLD: 0.01 IOU_THRESHOLD: 0.6 MAX_DETECTIONS: 100

DATASET: DATASET: 'coco' DATASET_DIR: '/home/chase/Downloads/ssds.pytorch-master/data/coco' TRAIN_SETS: [['2017', 'train']] TEST_SETS: [['2017', 'val']] PROB: 0.6

EXP_DIR: './experiments/models/fssd_vgg16_coco' LOG_DIR: './experiments/models/fssd_vgg16_coco' RESUME_CHECKPOINT: '/home/chase/Downloads/ssds.pytorch-master/weight/vgg16_fssd_coco_27.2.pth' PHASE: ['train'] 另外,我还试了 RESUME_CHECKPOINT:vgg16_reducedfc.pth,但是效果差不多。这个问题困扰我很长时间了,我不知道怎么回事,希望你能指点一下 @1453042287 @blueardour @cvtower

你好,我最近训练也遇到loss不下降的问题,一直维持在4左右,下载的模型,没做任何修改,只是重新加载base进行训练,求问你最终是如何解决的,万分感谢~