NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

Converting fp32 models to fp16 #499

Closed abdelrahman-gaber closed 5 years ago

abdelrahman-gaber commented 6 years ago

Hi,

I am trying to train a model depending on SSD, so normally I use the VGG pretrained model to initialize the weights of the model, which gives much better performance. However, when I run the training in fp16 mode the initialization from VGG model is not working, and I got wrong loss values during training mbox_loss = inf

So, Is there any way to convert the fp32 model to fp16 model ? I searched for VGG pretrained model in fp16 mode, but couldn't find any. Is there pretrained models like this that I don't know about (kind of model zoo for nvidia caffe) ?

Thanks in advance.

drnikolaev commented 6 years ago

Hi @abdelrahman-gaber it's a known bug, fix is coming...

abdelrahman-gaber commented 6 years ago

Hi @drnikolaev Thank you for your reply. Do you have any estimation for the time needed to fix this bug ? Also, will this fix include the problem in the last comment in this issue https://github.com/NVIDIA/caffe/issues/490

Thank you.

drnikolaev commented 6 years ago

@abdelrahman-gaber sorry, swamped with multiple issues...

bonseyes-admin commented 6 years ago

Hi - any update on this issue to train fp16 models?

drnikolaev commented 6 years ago

@abdelrahman-gaber could you verify https://github.com/drnikolaev/caffe/tree/caffe-0.17 release candidate?

abdelrahman-gaber commented 6 years ago

So I tested my SFD model with the same VGG pretrained model, and the problem is still there. The loss in int, nan, and -nan ! This is the output I got after running the training.

` I0820 22:20:21.132539 9094 caffe.cpp:155] Finetuning from /home/ubuntu/caffe-NVIDIA/models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel I0820 22:20:21.181888 9094 upgrade_proto.cpp:66] Attempting to upgrade input file specified using deprecated input fields: /home/ubuntu/caffe-NVIDIA/models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel I0820 22:20:21.181924 9094 upgrade_proto.cpp:69] Successfully upgraded file specified using deprecated input fields. W0820 22:20:21.181931 9094 upgrade_proto.cpp:71] Note that future Caffe releases will only support input layers and not input fields. I0820 22:20:21.182101 9094 net.cpp:1138] Copying source layer conv1_1 Type:Convolution #blobs=2 I0820 22:20:21.182163 9094 net.cpp:1138] Copying source layer relu1_1 Type:ReLU #blobs=0 I0820 22:20:21.182174 9094 net.cpp:1138] Copying source layer conv1_2 Type:Convolution #blobs=2 I0820 22:20:21.182714 9094 net.cpp:1138] Copying source layer relu1_2 Type:ReLU #blobs=0 I0820 22:20:21.182725 9094 net.cpp:1138] Copying source layer pool1 Type:Pooling #blobs=0 I0820 22:20:21.182731 9094 net.cpp:1138] Copying source layer conv2_1 Type:Convolution #blobs=2 I0820 22:20:21.183720 9094 net.cpp:1138] Copying source layer relu2_1 Type:ReLU #blobs=0 I0820 22:20:21.183732 9094 net.cpp:1138] Copying source layer conv2_2 Type:Convolution #blobs=2 I0820 22:20:21.185655 9094 net.cpp:1138] Copying source layer relu2_2 Type:ReLU #blobs=0 I0820 22:20:21.185667 9094 net.cpp:1138] Copying source layer pool2 Type:Pooling #blobs=0 I0820 22:20:21.185672 9094 net.cpp:1138] Copying source layer conv3_1 Type:Convolution #blobs=2 I0820 22:20:21.189498 9094 net.cpp:1138] Copying source layer relu3_1 Type:ReLU #blobs=0 I0820 22:20:21.189510 9094 net.cpp:1138] Copying source layer conv3_2 Type:Convolution #blobs=2 I0820 22:20:21.197180 9094 net.cpp:1138] Copying source layer relu3_2 Type:ReLU #blobs=0 I0820 22:20:21.197193 9094 net.cpp:1138] Copying source layer conv3_3 Type:Convolution #blobs=2 I0820 22:20:21.204839 9094 net.cpp:1138] Copying source layer relu3_3 Type:ReLU #blobs=0 I0820 22:20:21.204852 9094 net.cpp:1138] Copying source layer pool3 Type:Pooling #blobs=0 I0820 22:20:21.204856 9094 net.cpp:1138] Copying source layer conv4_1 Type:Convolution #blobs=2 I0820 22:20:21.220011 9094 net.cpp:1138] Copying source layer relu4_1 Type:ReLU #blobs=0 I0820 22:20:21.220034 9094 net.cpp:1138] Copying source layer conv4_2 Type:Convolution #blobs=2 I0820 22:20:21.251163 9094 net.cpp:1138] Copying source layer relu4_2 Type:ReLU #blobs=0 I0820 22:20:21.251199 9094 net.cpp:1138] Copying source layer conv4_3 Type:Convolution #blobs=2 I0820 22:20:21.281571 9094 net.cpp:1138] Copying source layer relu4_3 Type:ReLU #blobs=0 I0820 22:20:21.281601 9094 net.cpp:1138] Copying source layer pool4 Type:Pooling #blobs=0 I0820 22:20:21.281606 9094 net.cpp:1138] Copying source layer conv5_1 Type:Convolution #blobs=2 I0820 22:20:21.312028 9094 net.cpp:1138] Copying source layer relu5_1 Type:ReLU #blobs=0 I0820 22:20:21.312058 9094 net.cpp:1138] Copying source layer conv5_2 Type:Convolution #blobs=2 I0820 22:20:21.342797 9094 net.cpp:1138] Copying source layer relu5_2 Type:ReLU #blobs=0 I0820 22:20:21.342823 9094 net.cpp:1138] Copying source layer conv5_3 Type:Convolution #blobs=2 I0820 22:20:21.373127 9094 net.cpp:1138] Copying source layer relu5_3 Type:ReLU #blobs=0 I0820 22:20:21.373154 9094 net.cpp:1138] Copying source layer pool5 Type:Pooling #blobs=0 I0820 22:20:21.373158 9094 net.cpp:1138] Copying source layer fc6 Type:Convolution #blobs=2 I0820 22:20:21.433754 9094 net.cpp:1138] Copying source layer relu6 Type:ReLU #blobs=0 I0820 22:20:21.433810 9094 net.cpp:1130] Ignoring source layer drop6 I0820 22:20:21.433816 9094 net.cpp:1138] Copying source layer fc7 Type:Convolution #blobs=2 I0820 22:20:21.447408 9094 net.cpp:1138] Copying source layer relu7 Type:ReLU #blobs=0 I0820 22:20:21.447443 9094 net.cpp:1130] Ignoring source layer drop7 I0820 22:20:21.447448 9094 net.cpp:1130] Ignoring source layer fc8 I0820 22:20:21.447453 9094 net.cpp:1130] Ignoring source layer prob I0820 22:20:21.487167 9094 upgrade_proto.cpp:66] Attempting to upgrade input file specified using deprecated input fields: /home/ubuntu/caffe-NVIDIA/models/VGGNet/VGG_ILSVRC_16_layers_fc_reduced.caffemodel I0820 22:20:21.487201 9094 upgrade_proto.cpp:69] Successfully upgraded file specified using deprecated input fields. W0820 22:20:21.487206 9094 upgrade_proto.cpp:71] Note that future Caffe releases will only support input layers and not input fields. I0820 22:20:21.487224 9094 net.cpp:1138] Copying source layer conv1_1 Type:Convolution #blobs=2 I0820 22:20:21.487272 9094 net.cpp:1138] Copying source layer relu1_1 Type:ReLU #blobs=0 I0820 22:20:21.487284 9094 net.cpp:1138] Copying source layer conv1_2 Type:Convolution #blobs=2 I0820 22:20:21.487802 9094 net.cpp:1138] Copying source layer relu1_2 Type:ReLU #blobs=0 I0820 22:20:21.487814 9094 net.cpp:1138] Copying source layer pool1 Type:Pooling #blobs=0 I0820 22:20:21.487818 9094 net.cpp:1138] Copying source layer conv2_1 Type:Convolution #blobs=2 I0820 22:20:21.488821 9094 net.cpp:1138] Copying source layer relu2_1 Type:ReLU #blobs=0 I0820 22:20:21.488833 9094 net.cpp:1138] Copying source layer conv2_2 Type:Convolution #blobs=2 I0820 22:20:21.490947 9094 net.cpp:1138] Copying source layer relu2_2 Type:ReLU #blobs=0 I0820 22:20:21.490960 9094 net.cpp:1138] Copying source layer pool2 Type:Pooling #blobs=0 I0820 22:20:21.490964 9094 net.cpp:1138] Copying source layer conv3_1 Type:Convolution #blobs=2 I0820 22:20:21.494841 9094 net.cpp:1138] Copying source layer relu3_1 Type:ReLU #blobs=0 I0820 22:20:21.494858 9094 net.cpp:1138] Copying source layer conv3_2 Type:Convolution #blobs=2 I0820 22:20:21.502475 9094 net.cpp:1138] Copying source layer relu3_2 Type:ReLU #blobs=0 I0820 22:20:21.502493 9094 net.cpp:1138] Copying source layer conv3_3 Type:Convolution #blobs=2 I0820 22:20:21.510107 9094 net.cpp:1138] Copying source layer relu3_3 Type:ReLU #blobs=0 I0820 22:20:21.510120 9094 net.cpp:1138] Copying source layer pool3 Type:Pooling #blobs=0 I0820 22:20:21.510124 9094 net.cpp:1138] Copying source layer conv4_1 Type:Convolution #blobs=2 I0820 22:20:21.525333 9094 net.cpp:1138] Copying source layer relu4_1 Type:ReLU #blobs=0 I0820 22:20:21.525362 9094 net.cpp:1138] Copying source layer conv4_2 Type:Convolution #blobs=2 I0820 22:20:21.555721 9094 net.cpp:1138] Copying source layer relu4_2 Type:ReLU #blobs=0 I0820 22:20:21.555745 9094 net.cpp:1138] Copying source layer conv4_3 Type:Convolution #blobs=2 I0820 22:20:21.586063 9094 net.cpp:1138] Copying source layer relu4_3 Type:ReLU #blobs=0 I0820 22:20:21.586097 9094 net.cpp:1138] Copying source layer pool4 Type:Pooling #blobs=0 I0820 22:20:21.586102 9094 net.cpp:1138] Copying source layer conv5_1 Type:Convolution #blobs=2 I0820 22:20:21.616988 9094 net.cpp:1138] Copying source layer relu5_1 Type:ReLU #blobs=0 I0820 22:20:21.617018 9094 net.cpp:1138] Copying source layer conv5_2 Type:Convolution #blobs=2 I0820 22:20:21.647490 9094 net.cpp:1138] Copying source layer relu5_2 Type:ReLU #blobs=0 I0820 22:20:21.647521 9094 net.cpp:1138] Copying source layer conv5_3 Type:Convolution #blobs=2 I0820 22:20:21.678678 9094 net.cpp:1138] Copying source layer relu5_3 Type:ReLU #blobs=0 I0820 22:20:21.678704 9094 net.cpp:1138] Copying source layer pool5 Type:Pooling #blobs=0 I0820 22:20:21.678709 9094 net.cpp:1138] Copying source layer fc6 Type:Convolution #blobs=2 I0820 22:20:21.739503 9094 net.cpp:1138] Copying source layer relu6 Type:ReLU #blobs=0 I0820 22:20:21.739529 9094 net.cpp:1130] Ignoring source layer drop6 I0820 22:20:21.739557 9094 net.cpp:1138] Copying source layer fc7 Type:Convolution #blobs=2 I0820 22:20:21.753051 9094 net.cpp:1138] Copying source layer relu7 Type:ReLU #blobs=0 I0820 22:20:21.753077 9094 net.cpp:1130] Ignoring source layer drop7 I0820 22:20:21.753080 9094 net.cpp:1130] Ignoring source layer fc8 I0820 22:20:21.753084 9094 net.cpp:1130] Ignoring source layer prob I0820 22:20:21.753150 9094 caffe.cpp:257] Starting Optimization I0820 22:20:21.753182 9094 solver.cpp:417] [0.0] Solving SFD_fp16_320x320_WiderFace_train Learning Rate Policy: multistep I0820 22:20:21.753229 9094 net.cpp:1419] [0.0] Reserving 44910464 bytes of shared learnable space for type FLOAT16 I0820 22:20:21.765022 9094 solver.cpp:257] Initial Test started... I0820 22:20:21.765064 9094 solver.cpp:599] Iteration 0, Testing net (#0) I0820 22:20:21.766458 9124 common.cpp:519] NVML initialized, thread 9124 I0820 22:20:21.774025 9094 net.cpp:1065] Ignoring source layer mbox_loss I0820 22:20:21.796170 9124 common.cpp:541] {0} NVML succeeded to set CPU affinity I0820 22:23:24.692744 9108 data_reader.cpp:321] Restarting data pre-fetching I0820 22:23:37.149575 9094 solver.cpp:717] Test net output mAP #0: detection_eval = 0.000606061 I0820 22:23:37.149849 9094 solver.cpp:262] Initial Test completed in 195.386s I0820 22:23:37.150450 9105 internal_thread.cpp:42] Restarting 4 internal thread(s) on device 0 I0820 22:23:37.150821 9105 internal_thread.cpp:18] Starting 1 internal thread(s) on device 0 I0820 22:23:37.151573 9105 data_reader.cpp:59] Data Reader threads: 3, out queues: 12, depth: 32 I0820 22:23:37.151578 9137 common.cpp:541] {0} NVML succeeded to set CPU affinity I0820 22:23:37.151723 9105 internal_thread.cpp:18] Starting 3 internal thread(s) on device 0 I0820 22:23:37.152441 9138 db_lmdb.cpp:36] Opened lmdb /home/ubuntu/data/WIDER_FACE/lmdb/WIDER_FACE_train_lmdb_NVCaffe I0820 22:23:37.153028 9139 db_lmdb.cpp:36] Opened lmdb /home/ubuntu/data/WIDER_FACE/lmdb/WIDER_FACE_train_lmdb_NVCaffe I0820 22:23:37.153642 9140 db_lmdb.cpp:36] Opened lmdb /home/ubuntu/data/WIDER_FACE/lmdb/WIDER_FACE_train_lmdb_NVCaffe I0820 22:23:37.157697 9105 annotated_data_layer.cpp:105] output data size: 32,3,320,320 I0820 22:23:37.158293 9105 annotated_data_layer.cpp:150] [n0.d0.r0] Output data size: 32, 3, 320, 320 I0820 22:23:37.158370 9105 data_layer.cpp:105] [n0.d0.r0] Parser threads: 3 (auto) I0820 22:23:37.158386 9105 data_layer.cpp:107] [n0.d0.r0] Transformer threads: 4 (auto) I0820 22:23:39.731691 9094 solver.cpp:341] [0.0] Iteration 0 (2.5818 s), loss = 8.64041 I0820 22:23:39.731930 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = 8.64041 ( 1 = 8.64041 loss) I0820 22:23:39.732014 9094 sgd_solver.cpp:180] [0.0] Iteration 0, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 1 I0820 22:23:41.896467 9094 solver.cpp:341] [0.0] Iteration 1 (2.16473 s), loss = 13.3003 I0820 22:23:41.896543 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = 17.9603 ( 1 = 17.9603 loss) I0820 22:23:42.439561 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv1_1' with space 0.61M 3/1 1Tp 1 1T (avail 7.78G, req 0.61M) t: 0 0 6.13 I0820 22:23:44.960943 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv1_2' with space 1.89G 64/1 1Tp 1Tp 1Tp (avail 5.89G, req 1.89G) t: 0 4.21 19.04 I0820 22:23:46.128866 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv2_1' with space 1.89G 64/1 1Tp 1Tp 1Tp (avail 5.89G, req 1.89G) t: 0 2.06 5.03 I0820 22:23:47.524387 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv2_2' with space 1.89G 128/1 1Tp 1Tp 1Tp (avail 5.89G, req 1.89G) t: 0 3.12 5.26 I0820 22:23:48.428375 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv3_1' with space 1.89G 128/1 7T 5T 1Tp (avail 5.89G, req 1.89G) t: 0 1.2 1.67 I0820 22:23:49.572372 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv3_2' with space 1.89G 256/1 7T 5T 1Tp (avail 5.89G, req 1.89G) t: 0 1.84 2.78 I0820 22:23:50.712419 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv3_3' with space 1.89G 256/1 7T 5T 1Tp (avail 5.89G, req 1.89G) t: 0 1.83 2.77 I0820 22:23:51.520373 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv4_1' with space 1.89G 256/1 7T 5T 1Tp (avail 5.89G, req 1.89G) t: 0 0.74 1.45 I0820 22:23:52.625185 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv4_2' with space 1.89G 512/1 7T 5T 5T (avail 5.89G, req 1.89G) t: 0 1.17 2.69 I0820 22:23:53.700377 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv4_3' with space 1.89G 512/1 7T 5T 5T (avail 5.89G, req 1.89G) t: 0 1.18 2.69 I0820 22:23:54.272415 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv5_1' with space 1.89G 512/1 7T 5T 5 (avail 5.89G, req 1.89G) t: 0 0.39 0.73 I0820 22:23:54.857569 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv5_2' with space 1.89G 512/1 7T 5T 5T (avail 5.89G, req 1.89G) t: 0 0.37 0.73 I0820 22:23:55.436425 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv5_3' with space 1.89G 512/1 7T 5T 5 (avail 5.89G, req 1.89G) t: 0 0.38 0.73 I0820 22:23:55.800324 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'fc7' with space 1.89G 1024/1 1T 1T 1Tp (avail 5.89G, req 1.89G) t: 0 0.16 0.27 I0820 22:23:56.108358 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv6_1' with space 1.89G 1024/1 1T 1Tp 1Tp (avail 5.89G, req 1.89G) t: 0 0.08 0.14 I0820 22:23:56.408141 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv6_2' with space 1.89G 256/1 1Tp 0p 1Tp (avail 5.89G, req 1.89G) t: 0 0.39 0.09 I0820 22:23:56.700345 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv7_1' with space 1.89G 512/1 1T 1T 1Tp (avail 5.89G, req 1.89G) t: 0 0.04 0.07 I0820 22:23:56.992316 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv7_2' with space 1.89G 128/1 1T 0p 1Tp (avail 5.89G, req 1.89G) t: 0 0.14 0.05 I0820 22:23:57.672441 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv3_3_norm_mbox_loc' with space 1.89G 256/1 7T 1Tp 5T (avail 5.89G, req 1.89G) t: 0 0.39 1.74 I0820 22:23:58.364471 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv3_3_norm_mbox_conf' with space 1.89G 256/1 7T 1Tp 5T (avail 5.89G, req 1.89G) t: 0 0.31 1.7 I0820 22:23:58.828485 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv4_3_norm_mbox_loc' with space 1.89G 512/1 7T 1Tp 5 (avail 5.89G, req 1.89G) t: 0 0.21 0.81 I0820 22:23:59.288451 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv4_3_norm_mbox_conf' with space 1.89G 512/1 7T 1Tp 5 (avail 5.89G, req 1.89G) t: 0 0.17 0.81 I0820 22:23:59.636363 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv5_3_norm_mbox_loc' with space 1.89G 512/1 7T 1Tp 5T (avail 5.89G, req 1.89G) t: 0 0.08 0.23 I0820 22:23:59.980507 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv5_3_norm_mbox_conf' with space 1.89G 512/1 7T 1Tp 5T (avail 5.89G, req 1.89G) t: 0 0.07 0.23 I0820 22:24:00.308429 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'fc7_mbox_loc' with space 1.89G 1024/1 7T 1p 5 (avail 5.89G, req 1.89G) t: 0 0.06 0.13 I0820 22:24:00.636428 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'fc7_mbox_conf' with space 1.89G 1024/1 7T 1p 5 (avail 5.89G, req 1.89G) t: 0 0.05 0.13 I0820 22:24:00.940327 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv6_2_mbox_loc' with space 1.89G 512/1 7T 1Tp 5T (avail 5.89G, req 1.89G) t: 0 0.04 0.06 I0820 22:24:01.244448 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv6_2_mbox_conf' with space 1.89G 512/1 7T 1Tp 5 (avail 5.89G, req 1.89G) t: 0 0.03 0.06 I0820 22:24:01.540439 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv7_2_mbox_loc' with space 1.89G 256/1 7T 5T 5Tp (avail 5.89G, req 1.89G) t: 0 0.03 0.03 I0820 22:24:01.836447 9094 cudnn_conv_layer.cpp:849] [n0.d0.r0] Conv Algos (F,BD,BF): 'conv7_2_mbox_conf' with space 1.89G 256/1 7T 5Tp 5T (avail 5.89G, req 1.89G) t: 0 0.03 0.03 I0820 22:24:02.148686 9094 solver.cpp:341] [0.0] Iteration 2 (20.2523 s), loss = inf I0820 22:24:02.148737 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:05.995604 9094 solver.cpp:333] [0.0] Iteration 10 (2.0796 iter/s, 3.8469s/8 iter), loss = inf I0820 22:24:05.995671 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:05.995687 9094 sgd_solver.cpp:180] [0.0] Iteration 10, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:11.887779 9094 solver.cpp:333] [0.0] Iteration 20 (1.69716 iter/s, 5.8922s/10 iter), loss = -nan I0820 22:24:11.887830 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:11.887846 9094 sgd_solver.cpp:180] [0.0] Iteration 20, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:17.729486 9094 solver.cpp:333] [0.0] Iteration 30 (1.71182 iter/s, 5.84174s/10 iter), loss = -nan I0820 22:24:17.729537 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:17.729553 9094 sgd_solver.cpp:180] [0.0] Iteration 30, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:23.432073 9094 solver.cpp:333] [0.0] Iteration 40 (1.7536 iter/s, 5.70255s/10 iter), loss = -nan I0820 22:24:23.432137 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:23.432152 9094 sgd_solver.cpp:180] [0.0] Iteration 40, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:28.478961 9094 solver.cpp:333] [0.0] Iteration 50 (1.98142 iter/s, 5.04689s/10 iter), loss = -nan I0820 22:24:28.479159 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:28.479177 9094 sgd_solver.cpp:180] [0.0] Iteration 50, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:36.274559 9094 solver.cpp:333] [0.0] Iteration 60 (1.28277 iter/s, 7.79561s/10 iter), loss = -nan I0820 22:24:36.274610 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:36.274621 9094 sgd_solver.cpp:180] [0.0] Iteration 60, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:41.894789 9094 solver.cpp:333] [0.0] Iteration 70 (1.77929 iter/s, 5.62021s/10 iter), loss = -nan I0820 22:24:41.894845 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:41.894863 9094 sgd_solver.cpp:180] [0.0] Iteration 70, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:47.746243 9094 solver.cpp:333] [0.0] Iteration 80 (1.70897 iter/s, 5.85147s/10 iter), loss = -nan I0820 22:24:47.746351 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:47.746412 9094 sgd_solver.cpp:180] [0.0] Iteration 80, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:24:53.682945 9094 solver.cpp:333] [0.0] Iteration 90 (1.68443 iter/s, 5.93671s/10 iter), loss = -nan I0820 22:24:53.682999 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:24:53.683012 9094 sgd_solver.cpp:180] [0.0] Iteration 90, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:00.531950 9094 solver.cpp:333] [0.0] Iteration 100 (1.46006 iter/s, 6.84903s/10 iter), loss = -nan I0820 22:25:00.532097 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:25:00.532110 9094 sgd_solver.cpp:180] [0.0] Iteration 100, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:05.310221 9094 solver.cpp:333] [0.0] Iteration 110 (2.09281 iter/s, 4.77827s/10 iter), loss = -nan I0820 22:25:05.310272 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:25:05.310284 9094 sgd_solver.cpp:180] [0.0] Iteration 110, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:11.620216 9094 solver.cpp:333] [0.0] Iteration 120 (1.58479 iter/s, 6.30998s/10 iter), loss = -nan I0820 22:25:11.620267 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:25:11.620285 9094 sgd_solver.cpp:180] [0.0] Iteration 120, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:16.537539 9094 solver.cpp:333] [0.0] Iteration 130 (2.03363 iter/s, 4.91731s/10 iter), loss = -nan I0820 22:25:16.537616 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:25:16.537633 9094 sgd_solver.cpp:180] [0.0] Iteration 130, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:21.898902 9094 solver.cpp:333] [0.0] Iteration 140 (1.8652 iter/s, 5.36136s/10 iter), loss = -nan I0820 22:25:21.898952 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:25:21.898963 9094 sgd_solver.cpp:180] [0.0] Iteration 140, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:27.963239 9094 solver.cpp:333] [0.0] Iteration 150 (1.64898 iter/s, 6.06436s/10 iter), loss = -nan I0820 22:25:27.963289 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0820 22:25:27.963300 9094 sgd_solver.cpp:180] [0.0] Iteration 150, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 32 I0820 22:25:34.039958 9094 solver.cpp:333] [0.0] Iteration 160 (1.64562 iter/s, 6.07672s/10 iter), loss = -nan I0820 22:25:34.040117 9094 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf (* 1 = inf loss)

`

drnikolaev commented 6 years ago

@abdelrahman-gaber gs = 32 means that you use fixed loss scale set to 32. Please switch to the adaptive one, i.e. replace

global_grad_scale: 32

by

global_grad_scale_adaptive: true
abdelrahman-gaber commented 6 years ago

@drnikolaev Now after using adaptive scaling I can see gs is changing but still the loss is nan and mbox_loss = inf ! Is there other changes I should do ?

`

I0822 20:54:57.256644 14071 solver.cpp:341] [0.0] Iteration 2 (20.2241 s), loss = nan I0822 20:54:57.256696 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:01.366313 14071 solver.cpp:333] [0.0] Iteration 10 (1.94662 iter/s, 4.10968s/8 iter), loss = nan I0822 20:55:01.366365 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:01.366377 14071 sgd_solver.cpp:180] [0.0] Iteration 10, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.21 I0822 20:55:06.431057 14071 solver.cpp:333] [0.0] Iteration 20 (1.97443 iter/s, 5.06474s/10 iter), loss = nan I0822 20:55:06.431109 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:06.431174 14071 sgd_solver.cpp:180] [0.0] Iteration 20, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 5.23 I0822 20:55:12.455972 14071 solver.cpp:333] [0.0] Iteration 30 (1.65977 iter/s, 6.02495s/10 iter), loss = nan I0822 20:55:12.456022 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:12.456033 14071 sgd_solver.cpp:180] [0.0] Iteration 30, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.77 I0822 20:55:20.274615 14071 solver.cpp:333] [0.0] Iteration 40 (1.27899 iter/s, 7.81864s/10 iter), loss = nan I0822 20:55:20.274780 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:20.274793 14071 sgd_solver.cpp:180] [0.0] Iteration 40, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 5.07 I0822 20:55:25.181753 14071 solver.cpp:333] [0.0] Iteration 50 (2.03785 iter/s, 4.90714s/10 iter), loss = nan I0822 20:55:25.181802 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:25.181813 14071 sgd_solver.cpp:180] [0.0] Iteration 50, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.31 I0822 20:55:32.607586 14071 solver.cpp:333] [0.0] Iteration 60 (1.34664 iter/s, 7.42587s/10 iter), loss = nan I0822 20:55:32.607695 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:32.607738 14071 sgd_solver.cpp:180] [0.0] Iteration 60, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 5.75 I0822 20:55:38.331291 14071 solver.cpp:333] [0.0] Iteration 70 (1.74712 iter/s, 5.72372s/10 iter), loss = nan I0822 20:55:38.331346 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:38.331362 14071 sgd_solver.cpp:180] [0.0] Iteration 70, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.3 I0822 20:55:43.483438 14071 solver.cpp:333] [0.0] Iteration 80 (1.94094 iter/s, 5.15215s/10 iter), loss = nan I0822 20:55:43.483516 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:43.483533 14071 sgd_solver.cpp:180] [0.0] Iteration 80, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.04 I0822 20:55:48.604272 14071 solver.cpp:333] [0.0] Iteration 90 (1.9528 iter/s, 5.12086s/10 iter), loss = nan I0822 20:55:48.604329 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:48.604343 14071 sgd_solver.cpp:180] [0.0] Iteration 90, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.09 I0822 20:55:54.375180 14071 solver.cpp:333] [0.0] Iteration 100 (1.73283 iter/s, 5.7709s/10 iter), loss = nan I0822 20:55:54.375332 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:55:54.375350 14071 sgd_solver.cpp:180] [0.0] Iteration 100, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.53 I0822 20:56:00.085430 14071 solver.cpp:333] [0.0] Iteration 110 (1.75123 iter/s, 5.71028s/10 iter), loss = nan I0822 20:56:00.085479 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:00.085491 14071 sgd_solver.cpp:180] [0.0] Iteration 110, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 8.23 I0822 20:56:06.544234 14071 solver.cpp:333] [0.0] Iteration 120 (1.54827 iter/s, 6.45881s/10 iter), loss = nan I0822 20:56:06.544281 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:06.544293 14071 sgd_solver.cpp:180] [0.0] Iteration 120, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 5.87 I0822 20:56:11.224117 14071 solver.cpp:333] [0.0] Iteration 130 (2.13681 iter/s, 4.67988s/10 iter), loss = nan I0822 20:56:11.224186 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:11.224206 14071 sgd_solver.cpp:180] [0.0] Iteration 130, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.79 I0822 20:56:16.546043 14071 solver.cpp:333] [0.0] Iteration 140 (1.87901 iter/s, 5.32194s/10 iter), loss = nan I0822 20:56:16.546113 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:16.546129 14071 sgd_solver.cpp:180] [0.0] Iteration 140, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 6.35 I0822 20:56:23.026548 14071 solver.cpp:333] [0.0] Iteration 150 (1.54308 iter/s, 6.48054s/10 iter), loss = nan I0822 20:56:23.026599 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:23.026612 14071 sgd_solver.cpp:180] [0.0] Iteration 150, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.57 I0822 20:56:29.065845 14071 solver.cpp:333] [0.0] Iteration 160 (1.65581 iter/s, 6.03932s/10 iter), loss = nan I0822 20:56:29.066016 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:29.066031 14071 sgd_solver.cpp:180] [0.0] Iteration 160, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.05 I0822 20:56:35.760040 14071 solver.cpp:333] [0.0] Iteration 170 (1.49383 iter/s, 6.69419s/10 iter), loss = nan I0822 20:56:35.760107 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:35.760123 14071 sgd_solver.cpp:180] [0.0] Iteration 170, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.38 I0822 20:56:41.882964 14071 solver.cpp:333] [0.0] Iteration 180 (1.63321 iter/s, 6.12291s/10 iter), loss = nan I0822 20:56:41.883016 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:41.883030 14071 sgd_solver.cpp:180] [0.0] Iteration 180, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.19 I0822 20:56:51.332212 14071 solver.cpp:333] [0.0] Iteration 190 (1.05828 iter/s, 9.44926s/10 iter), loss = nan I0822 20:56:51.332365 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:51.332435 14071 sgd_solver.cpp:180] [0.0] Iteration 190, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.35 I0822 20:56:56.010116 14071 solver.cpp:333] [0.0] Iteration 200 (2.13769 iter/s, 4.67794s/10 iter), loss = nan I0822 20:56:56.010179 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:56:56.010195 14071 sgd_solver.cpp:180] [0.0] Iteration 200, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 8.12 I0822 20:57:02.142762 14071 solver.cpp:333] [0.0] Iteration 210 (1.63062 iter/s, 6.13264s/10 iter), loss = nan I0822 20:57:02.142940 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:57:02.142959 14071 sgd_solver.cpp:180] [0.0] Iteration 210, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.66 I0822 20:57:07.456576 14071 solver.cpp:333] [0.0] Iteration 220 (1.88188 iter/s, 5.31384s/10 iter), loss = nan I0822 20:57:07.456625 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:57:07.456638 14071 sgd_solver.cpp:180] [0.0] Iteration 220, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 7.09 I0822 20:57:12.558775 14071 solver.cpp:333] [0.0] Iteration 230 (1.95995 iter/s, 5.10218s/10 iter), loss = nan I0822 20:57:12.558893 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf ( 1 = inf loss) I0822 20:57:12.558938 14071 sgd_solver.cpp:180] [0.0] Iteration 230, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 8.06 I0822 20:57:17.807420 14071 solver.cpp:333] [0.0] Iteration 240 (1.90524 iter/s, 5.24867s/10 iter), loss = nan I0822 20:57:17.807469 14071 solver.cpp:361] [0.0] Train net output #0: mbox_loss = inf (* 1 = inf loss) I0822 20:57:17.807480 14071 sgd_solver.cpp:180] [0.0] Iteration 240, lr = 0.001, m = 0.9, lrm = 0.01, wd = 0.0005, gs = 5.17

`

drnikolaev commented 6 years ago

@abdelrahman-gaber seems like the problem is in tuning. Train from scratch works fine. May I have VGG_ILSVRC_16_layers_fc_reduced.caffemodel file please? Also, does train from scratch work on your side?

abdelrahman-gaber commented 6 years ago

@drnikolaev This is the VGG pretrained model I am using: https://gist.github.com/weiliu89/2ed6e13bfd5b57cf81d6

And yes training from scratch is working fine.

drnikolaev commented 6 years ago

@abdelrahman-gaber I reproduced the issue and managed to avoid it by changing back the following:

base_lr: 0.0000039063 
iter_size: 1

Then

layer {
  name: "fc6"
  type: "Convolution"
  bottom: "pool5"
  top: "fc6"
  param {
    lr_mult: 1.0
    decay_mult: 1.0
  }
  param {
    lr_mult: 2.0
    decay_mult: 0.0
  }
  convolution_param {

  engine: CAFFE

    num_output: 1024
    pad: 6
    kernel_size: 3
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
    dilation: 6
  }
}

Also it's good to set test.prototxt to fp32 cause it's packed with CPU only layers. Result:

I0823 14:53:45.023775 17928 solver.cpp:341]     [0.0] Iteration 2 (34.2498 s), loss = 8.96872
I0823 14:53:45.023818 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 8.81912 (* 1 = 8.81912 loss)
I0823 14:53:51.138836 17928 solver.cpp:333]     [0.0] Iteration 10 (1.30825 iter/s, 6.11505s/8 iter), loss = 9.01866
I0823 14:53:51.138877 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 8.87378 (* 1 = 8.87378 loss)
I0823 14:53:51.138895 17928 sgd_solver.cpp:180] [0.0] Iteration 10, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 2.8
I0823 14:53:57.892297 17928 solver.cpp:333]     [0.0] Iteration 20 (1.48072 iter/s, 6.75346s/10 iter), loss = 8.17407
I0823 14:53:57.892340 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 8.47063 (* 1 = 8.47063 loss)
I0823 14:53:57.892359 17928 sgd_solver.cpp:180] [0.0] Iteration 20, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 3.9
I0823 14:54:08.990861 17928 solver.cpp:333]     [0.0] Iteration 30 (0.901015 iter/s, 11.0986s/10 iter), loss = 8.20302
I0823 14:54:08.991082 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 8.21322 (* 1 = 8.21322 loss)
I0823 14:54:08.991102 17928 sgd_solver.cpp:180] [0.0] Iteration 30, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 2.75
I0823 14:54:13.474453 17928 solver.cpp:333]     [0.0] Iteration 40 (2.23171 iter/s, 4.48087s/10 iter), loss = 8.28199
I0823 14:54:13.475052 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 7.97637 (* 1 = 7.97637 loss)
I0823 14:54:13.475071 17928 sgd_solver.cpp:180] [0.0] Iteration 40, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 1.8
I0823 14:54:17.909873 17928 solver.cpp:333]     [0.0] Iteration 50 (2.25321 iter/s, 4.43812s/10 iter), loss = 8.12758
I0823 14:54:17.909909 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 8.0902 (* 1 = 8.0902 loss)
I0823 14:54:17.909922 17928 sgd_solver.cpp:180] [0.0] Iteration 50, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 3.71
I0823 14:54:24.405911 17928 solver.cpp:333]     [0.0] Iteration 60 (1.5394 iter/s, 6.49604s/10 iter), loss = 8.13621
I0823 14:54:24.405944 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 7.44158 (* 1 = 7.44158 loss)
I0823 14:54:24.405957 17928 sgd_solver.cpp:180] [0.0] Iteration 60, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 4.01
I0823 14:54:28.718925 17928 solver.cpp:333]     [0.0] Iteration 70 (2.31857 iter/s, 4.313s/10 iter), loss = 8.24258
I0823 14:54:28.718972 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 7.98699 (* 1 = 7.98699 loss)
I0823 14:54:28.718999 17928 sgd_solver.cpp:180] [0.0] Iteration 70, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 5.99
I0823 14:54:33.036012 17928 solver.cpp:333]     [0.0] Iteration 80 (2.31637 iter/s, 4.3171s/10 iter), loss = 7.90585
I0823 14:54:33.036046 17928 solver.cpp:361]     [0.0]     Train net output #0: mbox_loss = 7.29813 (* 1 = 7.29813 loss)
I0823 14:54:33.036059 17928 sgd_solver.cpp:180] [0.0] Iteration 80, lr = 3.9063e-06, m = 0.9, lrm = 3.9063e-05, wd = 0.0005, gs = 3.54
abdelrahman-gaber commented 6 years ago

@drnikolaev I did the modifications you mentioned but still facing the same problem, loss is -nan These are the files I am using for training, I think it would be good if you have a look on them. https://drive.google.com/open?id=13hpIKMnKcVimTY84xzqeoTGSFWBnL18e

Thanks.

drnikolaev commented 5 years ago

Better fp16 is coming in new release, preview is available here: https://github.com/drnikolaev/caffe