deepinsight / insightface

State-of-the-art 2D and 3D Face Analysis Project
https://insightface.ai
23.33k stars 5.41k forks source link

Lossvalue becomes nan when training sdu-net with 3D datasets #544

Closed deepinx closed 1 year ago

deepinx commented 5 years ago

When I train sdu-net with 3D datasets, after training for about 500 batches, the lossvalue becomes nan and NME of AFLW2000-3D shapely increased. I don't know why.

Here are the config parameters I used.

gpu num: 1
Call with Namespace(batch_size=16, ckpt=1, ctx_num=1, dataset='i3d', exf=1, frequent=20, lr=0.00025, lr_step='16000,24000,30000', network='sdu', norm=0, optimizer='nadam', per_batch_size=16, prefix='model/sdu', pretrained='', verbose=200, wd=0.0) {'net_binarize': False, 'net_dcn': 3, 'gaussian': 0, 'net_block': 'cab', 'dataset': 'i3d', 'record_img_size': 384, 'landmark_type': '3d', 'net_coherent': False, 'base_scale': 256, 'val_targets': ['AFLW2000-3D'], 'network': 'sdu', 'net_stacks': 2, 'losstype': 'heatmap', 'input_img_size': 128, 'net_sta': 1, 'multiplier': 1.0, 'num_classes': 68, 'output_label_size': 64, 'label_xfirst': False, 'per_batch_size': 16, 'net_n': 3, 'dataset_path': '/media/3T_disk/my_datasets/sdu_net/data_3d'}
INFO:root:loading recordio /media/3T_disk/my_datasets/sdu_net/data_3d/train.rec...
('train size', 61225)
('train size after reset', 61225)
binarize False
use_coherent False
use_STA 1
use_N 3
use_DCN 3
per_batch_size 16
128 64 2

Here is the output information during training.

[200][AFLW2000-3D]NME: 0.124324
saving 1
INFO:root:Saved checkpoint to "model/sdu-0001.params"
INFO:root:Epoch[0] Batch [200]  Speed: 4.03 samples/sec lossvalue=0.001130
INFO:root:Epoch[0] Batch [220]  Speed: 14.01 samples/sec        lossvalue=0.001072
INFO:root:Epoch[0] Batch [240]  Speed: 13.94 samples/sec        lossvalue=0.001073
INFO:root:Epoch[0] Batch [260]  Speed: 13.93 samples/sec        lossvalue=0.001054
INFO:root:Epoch[0] Batch [280]  Speed: 13.92 samples/sec        lossvalue=0.001047
INFO:root:Epoch[0] Batch [300]  Speed: 13.91 samples/sec        lossvalue=0.001041
INFO:root:Epoch[0] Batch [320]  Speed: 13.90 samples/sec        lossvalue=0.001011
INFO:root:Epoch[0] Batch [340]  Speed: 13.88 samples/sec        lossvalue=0.001001
INFO:root:Epoch[0] Batch [360]  Speed: 13.89 samples/sec        lossvalue=0.001021
INFO:root:Epoch[0] Batch [380]  Speed: 13.90 samples/sec        lossvalue=0.000992
INFO:root:loading recordio /media/3T_disk/my_datasets/sdu_net/data_3d/AFLW2000-3D.rec...
('train size', 2000)
[400][AFLW2000-3D]NME: 0.060307
saving 2
INFO:root:Saved checkpoint to "model/sdu-0002.params"
INFO:root:Epoch[0] Batch [400]  Speed: 4.04 samples/sec lossvalue=0.000992
INFO:root:Epoch[0] Batch [420]  Speed: 13.94 samples/sec        lossvalue=0.000975
INFO:root:Epoch[0] Batch [440]  Speed: 13.89 samples/sec        lossvalue=0.000965
INFO:root:Epoch[0] Batch [460]  Speed: 13.88 samples/sec        lossvalue=0.000950
INFO:root:Epoch[0] Batch [480]  Speed: 13.88 samples/sec        lossvalue=0.000936
INFO:root:Epoch[0] Batch [500]  Speed: 13.88 samples/sec        lossvalue=0.000960
INFO:root:Epoch[0] Batch [520]  Speed: 14.49 samples/sec        lossvalue=nan
INFO:root:Epoch[0] Batch [540]  Speed: 14.55 samples/sec        lossvalue=nan
INFO:root:Epoch[0] Batch [560]  Speed: 14.55 samples/sec        lossvalue=nan
INFO:root:Epoch[0] Batch [580]  Speed: 14.57 samples/sec        lossvalue=nan
INFO:root:loading recordio /media/3T_disk/my_datasets/sdu_net/data_3d/AFLW2000-3D.rec...
('train size', 2000)
[600][AFLW2000-3D]NME: 0.536341
saving 3
INFO:root:Saved checkpoint to "model/sdu-0003.params"
INFO:root:Epoch[0] Batch [600]  Speed: 4.11 samples/sec lossvalue=nan
INFO:root:Epoch[0] Batch [620]  Speed: 14.33 samples/sec        lossvalue=nan
xjock commented 3 years ago

请问解决了吗?同样的问题