Overflow occurs when training MNC with the VGG16 net

JulianoLagana commented 7 years ago

Hi everyone.

I'm trying to train the default VGG16 implementation of MNC with the command ./experiments/scripts/mnc_5stage.sh 0 VGG16

However, after some iterations I run into an overflow error:

Error messages

/home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: overflow encountered in exp bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:213: RuntimeWarning: invalid value encountered in multiply dfdxc * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:217: RuntimeWarning: invalid value encountered in multiply dfdw * np.exp(bottom[1].data[0, 4*c+2, h, w]) * anchor_w * weight_out_proposal * weight_out_anchor /home/juliano/MNC/tools/../lib/pylayer/stage_bridge_layer.py:107: RuntimeWarning: invalid value encountered in float_scalars bottom[0].diff[i, 3] = dfdw[ind] * (delta_x + np.exp(delta_w)) /home/juliano/MNC/tools/../lib/pylayer/proposal_layer.py:183: RuntimeWarning: invalid value encountered in greater top_non_zero_ind = np.unique(np.where(abs(top[0].diff[:, :]) > 0)[0]) /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:86: RuntimeWarning: overflow encountered in exp pred_w = np.exp(dw) * widths[:, np.newaxis] /home/juliano/MNC/tools/../lib/transform/bbox_transform.py:129: RuntimeWarning: invalid value encountered in greater_equal keep = np.where((ws >= min_size) & (hs >= min_size))[0] ./experiments/scripts/mnc_5stage.sh: line 35: 22873 Floating point exception(core dumped) ./tools/train_net.py --gpu ${GPU_ID} --solver models/${NET}/mnc_5stage/solver.prototxt --weights ${NET_INIT} --imdb ${DATASET_TRAIN} --iters ${ITERS} --cfg experiments/cfgs/${NET}/mnc_5stage.yml ${EXTRA_ARGS}

I saw in issue #22 that user @brisker experienced the same error when trying to train the MNC with his own dataset. The advice given there was to lower the training rate. Lowering it also helped in my case, but even at 1/10th of the original learning rate the same problem occurs, only later in the training process. User @souryuu mentioned that he needed to use a learning rate 100x times smaller to avoid this problem, which lead to a poorer performance of the end-result net (possibly because he ran for the same number of iterations, not 100 times longer).

Wasn't anyone able to run the training with the default learning rate provided by the creators, but without running into overflow problems? I'm simply trying to train the default implementation of the network, with the default dataset. I'm guessing this means it should be possible to use the default learning rate, no?

JulianoLagana commented 7 years ago

Even using a learning rate 100x smaller than the default one still gives the same error (but now even further into the optimization, around iteration 2370).

leduckhc commented 7 years ago

Hi @JulianoLagana, did you managed to solve the problem? I did try to clip all the np.exp expressions to some value though still failing due to signal 8: SIGFPE (floating point error)

x = np.clip(x, -10, 10)
np.exp(x)

JulianoLagana commented 7 years ago

Hi @leduckhc. No, unfortunately I didn't. These and other problems with this implementation led me to a different research direction. I hope you manage it, though.

leduckhc commented 7 years ago

Hi @JulianoLagana . I just figured out that the weights of conv5_3 and lower (conv5{2,1}, conv4{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth.

I explored values and weights by going through

print {k: v.data for k, v in self.solver.net.blobs.items()}
print {k: v[0].data for k, v in self.solver.net.params.items()}
# v[0] is for weights, v[1] for biases

JulianoLagana commented 7 years ago

I see, thanks for sharing it! If you do find a workaround for this issue, I'd be very interested.

tis 4 apr. 2017 kl. 19:24 skrev leduckhc notifications@github.com:

Hi @JulianoLagana https://github.com/JulianoLagana . I just figured out that the weights of conv53 and lower (conv5{2,1}, conv4_{1,2,3}, etc) contains NaNs. So the reason might be in bad initialization/loading of the network from caffemodel. I am gonna examine it in more depth.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daijifeng001/MNC/issues/41#issuecomment-291571763, or mute the thread https://github.com/notifications/unsubscribe-auth/ADz5UNN_GlW8Z3P15TX66GD_HI4YmHxWks5rsny4gaJpZM4MDC2I .

leduckhc commented 7 years ago

Check #53 for solution

feichtenhofer commented 7 years ago

Freezing layers is not a solution.

daijifeng001 / MNC

Overflow occurs when training MNC with the VGG16 net #41

Error messages