Closed deepinx closed 1 year ago
When I train sdu-net with 3D datasets, after training for about 500 batches, the lossvalue becomes nan and NME of AFLW2000-3D shapely increased. I don't know why.
Here are the config parameters I used.
gpu num: 1 Call with Namespace(batch_size=16, ckpt=1, ctx_num=1, dataset='i3d', exf=1, frequent=20, lr=0.00025, lr_step='16000,24000,30000', network='sdu', norm=0, optimizer='nadam', per_batch_size=16, prefix='model/sdu', pretrained='', verbose=200, wd=0.0) {'net_binarize': False, 'net_dcn': 3, 'gaussian': 0, 'net_block': 'cab', 'dataset': 'i3d', 'record_img_size': 384, 'landmark_type': '3d', 'net_coherent': False, 'base_scale': 256, 'val_targets': ['AFLW2000-3D'], 'network': 'sdu', 'net_stacks': 2, 'losstype': 'heatmap', 'input_img_size': 128, 'net_sta': 1, 'multiplier': 1.0, 'num_classes': 68, 'output_label_size': 64, 'label_xfirst': False, 'per_batch_size': 16, 'net_n': 3, 'dataset_path': '/media/3T_disk/my_datasets/sdu_net/data_3d'} INFO:root:loading recordio /media/3T_disk/my_datasets/sdu_net/data_3d/train.rec... ('train size', 61225) ('train size after reset', 61225) binarize False use_coherent False use_STA 1 use_N 3 use_DCN 3 per_batch_size 16 128 64 2
Here is the output information during training.
[200][AFLW2000-3D]NME: 0.124324 saving 1 INFO:root:Saved checkpoint to "model/sdu-0001.params" INFO:root:Epoch[0] Batch [200] Speed: 4.03 samples/sec lossvalue=0.001130 INFO:root:Epoch[0] Batch [220] Speed: 14.01 samples/sec lossvalue=0.001072 INFO:root:Epoch[0] Batch [240] Speed: 13.94 samples/sec lossvalue=0.001073 INFO:root:Epoch[0] Batch [260] Speed: 13.93 samples/sec lossvalue=0.001054 INFO:root:Epoch[0] Batch [280] Speed: 13.92 samples/sec lossvalue=0.001047 INFO:root:Epoch[0] Batch [300] Speed: 13.91 samples/sec lossvalue=0.001041 INFO:root:Epoch[0] Batch [320] Speed: 13.90 samples/sec lossvalue=0.001011 INFO:root:Epoch[0] Batch [340] Speed: 13.88 samples/sec lossvalue=0.001001 INFO:root:Epoch[0] Batch [360] Speed: 13.89 samples/sec lossvalue=0.001021 INFO:root:Epoch[0] Batch [380] Speed: 13.90 samples/sec lossvalue=0.000992 INFO:root:loading recordio /media/3T_disk/my_datasets/sdu_net/data_3d/AFLW2000-3D.rec... ('train size', 2000) [400][AFLW2000-3D]NME: 0.060307 saving 2 INFO:root:Saved checkpoint to "model/sdu-0002.params" INFO:root:Epoch[0] Batch [400] Speed: 4.04 samples/sec lossvalue=0.000992 INFO:root:Epoch[0] Batch [420] Speed: 13.94 samples/sec lossvalue=0.000975 INFO:root:Epoch[0] Batch [440] Speed: 13.89 samples/sec lossvalue=0.000965 INFO:root:Epoch[0] Batch [460] Speed: 13.88 samples/sec lossvalue=0.000950 INFO:root:Epoch[0] Batch [480] Speed: 13.88 samples/sec lossvalue=0.000936 INFO:root:Epoch[0] Batch [500] Speed: 13.88 samples/sec lossvalue=0.000960 INFO:root:Epoch[0] Batch [520] Speed: 14.49 samples/sec lossvalue=nan INFO:root:Epoch[0] Batch [540] Speed: 14.55 samples/sec lossvalue=nan INFO:root:Epoch[0] Batch [560] Speed: 14.55 samples/sec lossvalue=nan INFO:root:Epoch[0] Batch [580] Speed: 14.57 samples/sec lossvalue=nan INFO:root:loading recordio /media/3T_disk/my_datasets/sdu_net/data_3d/AFLW2000-3D.rec... ('train size', 2000) [600][AFLW2000-3D]NME: 0.536341 saving 3 INFO:root:Saved checkpoint to "model/sdu-0003.params" INFO:root:Epoch[0] Batch [600] Speed: 4.11 samples/sec lossvalue=nan INFO:root:Epoch[0] Batch [620] Speed: 14.33 samples/sec lossvalue=nan
请问解决了吗?同样的问题
When I train sdu-net with 3D datasets, after training for about 500 batches, the lossvalue becomes nan and NME of AFLW2000-3D shapely increased. I don't know why.
Here are the config parameters I used.
Here is the output information during training.