loss出现NAN - Githubissues

loveandhope commented 4 years ago

配置如下： ==> torch version: 1.2.0 ==> cudnn version: 7602 ==> Cmd: ['main.py', '--input_res', '512'] ==> Opt: K: 200 aggr_weight: 0.0 agnostic_ex: False arch: mobilev2_10 aug_ddd: 0.5 aug_rot: 0 batch_size: 8 cat_spec_wh: False center_thresh: 0.1 chunk_sizes: [15] data_dir: /home/rji/workspace/CenterFace/src/lib/../../data dataset: facehp debug: 0 debug_dir: /home/rji/workspace/CenterFace/src/lib/../../exp/multi_pose/mobilev2_10/debug debugger_theme: white demo: /home/yangna/data/WIDER_FACE/WIDER_train/images/0--Parade/0_Parade_marchingband_1_80.jpg dense_hp: False dense_wh: False dep_weight: 1 dim_weight: 1 down_ratio: 4 eval_oracle_dep: False eval_oracle_hm: False eval_oracle_hmhp: False eval_oracle_hp_offset: False eval_oracle_kps: False eval_oracle_offset: False eval_oracle_wh: False exp_dir: /home/rji/workspace/CenterFace/src/lib/../../exp/multi_pose exp_id: dla fix_res: True flip: 0.5 flip_idx: [[0, 1], [3, 4]] flip_test: False gpus: [0] gpus_str: 0 head_conv: 64 heads: {'landmarks': 10, 'hm_offset': 2, 'wh': 2, 'hm': 1} hide_data_time: False hm_hp: True hm_hp_weight: 1 hm_weight: 1 input_h: 512 input_res: 512 input_w: 512 keep_res: False kitti_split: 3dop lm_weight: 0.1 load_model: lr: 0.000125 lr_step: [30, 80] master_batch_size: 15 mean: [[[0.40789655 0.44719303 0.47026116]]] metric: loss mse_loss: False nms: False no_color_aug: False norm_wh: False not_cuda_benchmark: False not_hm_hp: False not_prefetch_test: False not_rand_crop: False not_reg_bbox: False not_reg_hp_offset: False not_reg_offset: False num_classes: 1 num_epochs: 140 num_iters: -1 num_stacks: 1 num_workers: 4 off_weight: 1 output_h: 128 output_res: 128 output_video: ../output/res_3.mp4 output_w: 128 pad: 31 peak_thresh: 0.2 print_iter: 0 rect_mask: False reg_bbox: True reg_hp_offset: True reg_loss: sl1 reg_offset: True resume: False root_dir: /home/rji/workspace/CenterFace/src/lib/../.. rot_weight: 1 rotate: 0 save_all: True save_dir: /home/rji/workspace/CenterFace/src/lib/../../exp/multi_pose/mobilev2_10 scale: 0.4 scores_thresh: 0.1 seed: 317 shift: 0.1 std: [[[0.2886383 0.27408165 0.27809834]]] task: multi_pose test: False test_scales: [1.0] train_json: None trainval: False val_intervals: 5 val_json: None vis_thresh: 0.4 wh_weight: 0.1

GYC1996 commented 4 years ago

你解决了吗，我训练的时候也出现了NAN

loveandhope commented 4 years ago

你解决了吗，我训练的时候也出现了NAN 对初始学习率的设置和batch size有关，是一个线性关系。比如：batch size为128时，lr为5e-4；batch size为32时，lr为1.25e-4。src/lib/opts_pose.py中，batch size默认为8，lr默认为1.25e-4，两者不匹配，lr需要修改为3.125e-5。

GYC1996 commented 4 years ago

你解决了吗，我训练的时候也出现了NAN 对初始学习率的设置和batch size有关，是一个线性关系。比如：batch size为128时，lr为5e-4；batch size为32时，lr为1.25e-4。src/lib/opts_pose.py中，batch size默认为8，lr默认为1.25e-4，两者不匹配，lr需要修改为3.125e-5。

谢谢，我试一下，我之前一直觉得学习率的设置应该关系不大，救把学习率改成了0.001

YangYangGirl commented 3 years ago

@loveandhope 请问您复现了作者的训练结果了么？我用mobilev2_10设置lr为3.125e-5，easy上的精度只有0.87；使用mobilev2_5设置lr为1.25e-4，精度在0.89。想请教一下！我的微信是18158332186。

chenjun2hao / CenterFace.pytorch

loss出现NAN #24