Open JulianoLagana opened 7 years ago
Even using a learning rate 100x smaller than the default one still gives the same error (but now even further into the optimization, around iteration 2370).
Hi @JulianoLagana, did you managed to solve the problem? I did try to clip all the np.exp
expressions to some value though still failing due to signal 8: SIGFPE (floating point error)
x = np.clip(x, -10, 10)
np.exp(x)
Hi @leduckhc. No, unfortunately I didn't. These and other problems with this implementation led me to a different research direction. I hope you manage it, though.
Hi @JulianoLagana . I just figured out that the weights of conv5_3
and lower (conv5{2,1}, conv4{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth.
I explored values and weights by going through
print {k: v.data for k, v in self.solver.net.blobs.items()}
print {k: v[0].data for k, v in self.solver.net.params.items()}
# v[0] is for weights, v[1] for biases
I see, thanks for sharing it! If you do find a workaround for this issue, I'd be very interested.
tis 4 apr. 2017 kl. 19:24 skrev leduckhc notifications@github.com:
Hi @JulianoLagana https://github.com/JulianoLagana . I just figured out that the weights of conv53 and lower (conv5{2,1}, conv4_{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daijifeng001/MNC/issues/41#issuecomment-291571763, or mute the thread https://github.com/notifications/unsubscribe-auth/ADz5UNN_GlW8Z3P15TX66GD_HI4YmHxWks5rsny4gaJpZM4MDC2I .
Check #53 for solution
Freezing layers is not a solution.
Hi everyone.
I'm trying to train the default VGG16 implementation of MNC with the command
./experiments/scripts/mnc_5stage.sh 0 VGG16
However, after some iterations I run into an overflow error:
Error messages
/home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:213: RuntimeWarning: invalid value encountered in multiply dfdxc * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:217: RuntimeWarning: invalid value encountered in multiply dfdw * np.exp(bottom[1].data[0, 4*c+2, h, w]) * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0]) /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] ./experiments/scripts/mnc_5stage.sh: line 35: 22873 Floating point exception(core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}
I saw in issue #22 that user @brisker experienced the same error when trying to train the MNC with his own dataset. The advice given there was to lower the training rate. Lowering it also helped in my case, but even at 1/10th of the original learning rate the same problem occurs, only later in the training process. User @souryuu mentioned that he needed to use a learning rate 100x times smaller to avoid this problem, which lead to a poorer performance of the end-result net (possibly because he ran for the same number of iterations, not 100 times longer).
Wasn't anyone able to run the training with the default learning rate provided by the creators, but without running into overflow problems? I'm simply trying to train the default implementation of the network, with the default dataset. I'm guessing this means it should be possible to use the default learning rate, no?