chenjun2hao / CenterFace.pytorch

unofficial version of centerface, which achieves the best balance between speed and accuracy at face detection
236 stars 60 forks source link

loss出现NAN #24

Closed loveandhope closed 4 years ago

loveandhope commented 4 years ago

使用最新的代码和wider face数据集(anno file从主页推荐的百度网盘下载的,val的anno貌似没有任何图片信息)进行训练,训练5次后loss出现NAN,且一直是NAN,请问是什么问题? 日志如下: 2020-05-28-20-43: epoch: 1 |hm_loss 134.458721 | wh_loss 0.675813 | lm_loss 2.126231 | time 2.216667 | loss 134.929304 | off_loss 0.190378 | 2020-05-28-20-45: epoch: 2 |hm_loss 8.054886 | wh_loss 0.578012 | lm_loss 2.000603 | time 2.200000 | loss 8.407020 | off_loss 0.094272 | 2020-05-28-20-47: epoch: 3 |hm_loss 3.317343 | wh_loss 0.546710 | lm_loss 1.981100 | time 2.200000 | loss 3.659947 | off_loss 0.089822 | 2020-05-28-20-49: epoch: 4 |hm_loss 1.972609 | wh_loss 0.476618 | lm_loss 2.065705 | time 2.200000 | loss 2.307610 | off_loss 0.080769 | 2020-05-28-20-51: epoch: 5 |hm_loss 1.477579 | wh_loss 0.284628 | lm_loss 1.835134 | time 2.200000 | loss 1.764335 | off_loss 0.074779 | hm_loss 1.495690 | wh_loss 0.301853 | lm_loss 1.877964 | time 3.700000 | loss 1.790609 | off_loss 0.076937 | 2020-05-28-20-57: epoch: 6 |hm_loss nan | wh_loss nan | lm_loss nan | time 2.166667 | loss nan | off_loss nan | 2020-05-28-20-59: epoch: 7 |hm_loss nan | wh_loss nan | lm_loss nan | time 2.183333 | loss nan | off_loss nan | 2020-05-28-21-02: epoch: 8 |hm_loss nan | wh_loss nan | lm_loss nan | time 2.183333 | loss nan | off_loss nan | 2020-05-28-21-04: epoch: 9 |hm_loss nan | wh_loss nan | lm_loss nan | time 2.183333 | loss nan | off_loss nan | 2020-05-28-21-06: epoch: 10 |hm_loss nan | wh_loss nan | lm_loss nan | time 2.183333 | loss nan | off_loss nan | hm_loss nan | wh_loss nan | lm_loss nan | time 3.700000 | loss nan | off_loss nan | 2020-05-28-21-12: epoch: 11 |hm_loss nan | wh_loss nan | lm_loss nan | time 2.166667 | loss nan | off_loss nan |

配置如下: ==> torch version: 1.2.0 ==> cudnn version: 7602 ==> Cmd: ['main.py', '--input_res', '512'] ==> Opt: K: 200 aggr_weight: 0.0 agnostic_ex: False arch: mobilev2_10 aug_ddd: 0.5 aug_rot: 0 batch_size: 8 cat_spec_wh: False center_thresh: 0.1 chunk_sizes: [15] data_dir: /home/rji/workspace/CenterFace/src/lib/../../data dataset: facehp debug: 0 debug_dir: /home/rji/workspace/CenterFace/src/lib/../../exp/multi_pose/mobilev2_10/debug debugger_theme: white demo: /home/yangna/data/WIDER_FACE/WIDER_train/images/0--Parade/0_Parade_marchingband_1_80.jpg dense_hp: False dense_wh: False dep_weight: 1 dim_weight: 1 down_ratio: 4 eval_oracle_dep: False eval_oracle_hm: False eval_oracle_hmhp: False eval_oracle_hp_offset: False eval_oracle_kps: False eval_oracle_offset: False eval_oracle_wh: False exp_dir: /home/rji/workspace/CenterFace/src/lib/../../exp/multi_pose exp_id: dla fix_res: True flip: 0.5 flip_idx: [[0, 1], [3, 4]] flip_test: False gpus: [0] gpus_str: 0 head_conv: 64 heads: {'landmarks': 10, 'hm_offset': 2, 'wh': 2, 'hm': 1} hide_data_time: False hm_hp: True hm_hp_weight: 1 hm_weight: 1 input_h: 512 input_res: 512 input_w: 512 keep_res: False kitti_split: 3dop lm_weight: 0.1 load_model: lr: 0.000125 lr_step: [30, 80] master_batch_size: 15 mean: [[[0.40789655 0.44719303 0.47026116]]] metric: loss mse_loss: False nms: False no_color_aug: False norm_wh: False not_cuda_benchmark: False not_hm_hp: False not_prefetch_test: False not_rand_crop: False not_reg_bbox: False not_reg_hp_offset: False not_reg_offset: False num_classes: 1 num_epochs: 140 num_iters: -1 num_stacks: 1 num_workers: 4 off_weight: 1 output_h: 128 output_res: 128 output_video: ../output/res_3.mp4 output_w: 128 pad: 31 peak_thresh: 0.2 print_iter: 0 rect_mask: False reg_bbox: True reg_hp_offset: True reg_loss: sl1 reg_offset: True resume: False root_dir: /home/rji/workspace/CenterFace/src/lib/../.. rot_weight: 1 rotate: 0 save_all: True save_dir: /home/rji/workspace/CenterFace/src/lib/../../exp/multi_pose/mobilev2_10 scale: 0.4 scores_thresh: 0.1 seed: 317 shift: 0.1 std: [[[0.2886383 0.27408165 0.27809834]]] task: multi_pose test: False test_scales: [1.0] train_json: None trainval: False val_intervals: 5 val_json: None vis_thresh: 0.4 wh_weight: 0.1

GYC1996 commented 4 years ago

你解决了吗,我训练的时候也出现了NAN

loveandhope commented 4 years ago

你解决了吗,我训练的时候也出现了NAN 对初始学习率的设置和batch size有关,是一个线性关系。比如:batch size为128时,lr为5e-4;batch size为32时,lr为1.25e-4。src/lib/opts_pose.py中,batch size默认为8,lr默认为1.25e-4,两者不匹配,lr需要修改为3.125e-5。

GYC1996 commented 4 years ago

你解决了吗,我训练的时候也出现了NAN 对初始学习率的设置和batch size有关,是一个线性关系。比如:batch size为128时,lr为5e-4;batch size为32时,lr为1.25e-4。src/lib/opts_pose.py中,batch size默认为8,lr默认为1.25e-4,两者不匹配,lr需要修改为3.125e-5。

谢谢,我试一下,我之前一直觉得学习率的设置应该关系不大,救把学习率改成了0.001

YangYangGirl commented 3 years ago

@loveandhope 请问您复现了作者的训练结果了么?我用mobilev2_10设置lr为3.125e-5,easy上的精度只有0.87;使用mobilev2_5设置lr为1.25e-4,精度在0.89。想请教一下!我的微信是18158332186。