JDAI-CV / centerX

This repo is implemented based on detectron2 and centernet
Apache License 2.0
555 stars 86 forks source link

KD training #8

Closed gneworld closed 3 years ago

gneworld commented 3 years ago

hi, I want to make a KD train with yamls/coco/centernet_res18_KD.yaml, but got error "exp_results/coco/coco_exp_R50_SGD_0.5/model_final.pth not found!", so how to get this teacher model, thanks much

CPFLAME commented 3 years ago

you can train a resnet50 model and a resnet18 model first, and then use the KD yaml to train

gneworld commented 3 years ago

I have trained a resnet50 model with yamls/coco/centernet_res50_coco_0.5.yaml, but only get a "exp_results/coco/coco_exp_R50_SGD_0.5/inference/instances_predictions.pth", not a model_final.pth file, so what steps am I missing?

CPFLAME commented 3 years ago

Have you end your training? The model_final.pth will be saved when the total training is end

gneworld commented 3 years ago

[12/07 10:49:35 d2.utils.events]: eta: 3 days, 1:46:30 iter: 144619 total_loss: 6.227 loss_cls: 4.281 loss_box_wh: 1.738 loss_off_reg: 0.2514 time: 0.3063 data_time: 0.0710 lr: 0.01 max_mem: 4798M [12/07 10:49:42 d2.utils.events]: eta: 3 days, 1:53:33 iter: 144639 total_loss: 6.227 loss_cls: 4.257 loss_box_wh: 1.671 loss_off_reg: 0.2595 time: 0.3063 data_time: 0.0763 lr: 0.01 max_mem: 4798M [12/07 10:49:48 d2.utils.events]: eta: 3 days, 1:53:08 iter: 144659 total_loss: 6.302 loss_cls: 4.289 loss_box_wh: 1.86 loss_off_reg: 0.2583 time: 0.3063 data_time: 0.0554 lr: 0.01 max_mem: 4798M ^C[12/07 10:49:48 d2.engine.hooks]: Overall training speed: 144659 iterations in 12:18:32 (0.3063 s / it) [12/07 10:49:48 d2.engine.hooks]: Total training time: 12:34:44 (0:16:11 on hooks) [12/07 10:49:48 d2.utils.events]: eta: 3 days, 1:53:07 iter: 144661 total_loss: 6.351 loss_cls: 4.318 loss_box_wh: 1.863 loss_off_reg: 0.2583 time: 0.3063 data_time: 0.0610 lr: 0.01 max_mem: 4798M

I have trained 14w steps, still can not get model_final.pth

CPFLAME commented 3 years ago

It seems you will get model_final after 3 days It's strange that your loss is too big, and what's your mAP now? It seems you modified the batch size and don't modified the base_lr, what's your yaml ? This is my log image

gneworld commented 3 years ago

thanks for your quick reply, I missed the lr reduced by the same times as the batch size from 64 to 8