Open FishWoWater opened 5 years ago
The OP was able to fix a Loss is NaN
issue (#182) by lowering the learning rate. Try and check if this works for you as well.
The OP was able to fix a
Loss is NaN
issue (#182) by lowering the learning rate. Try and check if this works for you as well.
Thanks a lot, but I have tried to lower my learning rate to 1e-8 and it still does not work
In your post, you have, "loss": "17451542731264.000000"
and all other losses are high as well.
Is it the same when you set the learning rate to 1e-8
?
In your post, you have,
"loss": "17451542731264.000000"
and all other losses are high as well. Is it the same when you set the learning rate to1e-8
?
The output when I lower the lr to 1e-8 is as follows:
json_stats: {"accuracy_cls": "0.000000", "eta": "19:23:53", "iter": 0, "loss": "47156856092672.000000", "loss_bbox": "1855492718592.000000", "loss_cls": "40212032913408.000000", "loss_rpn_bbox_fpn2": "0.000000", "loss_rpn_bbox_fpn3": "0.000000", "loss_rpn_bbox_fpn4": "1970523504640.000000", "loss_rpn_bbox_fpn5": "225773981696.000000", "loss_rpn_bbox_fpn6": "0.000000", "loss_rpn_cls_fpn2": "2342287245312.000000", "loss_rpn_cls_fpn3": "31649947648.000000", "loss_rpn_cls_fpn4": "469987680256.000000", "loss_rpn_cls_fpn5": "49108101120.000000", "loss_rpn_cls_fpn6": "0.000000", "lr": "0.000000", "mb_qsize": 64, "mem": 6056, "time": "6.983366"}
/home/slashgns/detect/detectron/detectron/utils/boxes.py:176: RuntimeWarning: overflow encountered in multiply
pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis]
/home/slashgns/detect/detectron/detectron/utils/boxes.py:177: RuntimeWarning: overflow encountered in multiply
pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis]
CRITICAL train.py: 101: Loss is NaN
emmmm When I updated the detectron version, it offered extra information, saying that there is overflow of dx. I found that the bbox_deltas are quite large(1e+12 or more). I think it is the problem of RPN but I can not solve the bug.......
Possibly it has something to do with the input normalisation that might be missing? I am facing problems as well, and it doesn't seem like inputs are rescaled to [-1,1], but are in range [-128,128] instead.
Possibly it has something to do with the input normalisation that might be missing? I am facing problems as well, and it doesn't seem like inputs are rescaled to [-1,1], but are in range [-128,128] instead.
Oh, yeah.... The normalization of my pretrained model( [-1,1]) is inconsistent with the normalization of detectron([-128, 128]). I will retrain the model and see whether the problem can be solved. Thank you!
Great to hear that this is indeed been the reason! Note that I do not understand why FAIR is using [-128,128] inputs at all, because this will lead to 128x larger activations on average. Because of that the training of the biases is very slow, for they weigh only a factor 1. I would recommend you to adjust prep_im_for_blob() in detectron/utils/blob.py to divide the im by 128 before returning. (Or create a field in core/config.py to toggle it.) Your model doesn't have to be re-trained, it should work then.
Emmmmm.... I still can not solve the problem. I have tried the following two ways but failed:
Aah yes that is also fine. I was thinking of: im -= means im /= 128 which is about similar to: im /= 255 im -= means/255
Hmmm, not sure I can help you any further. What is the loss at iteration 0 when normalizing? Maybe you can get insight by printing out some blobs. I'm using the following blob summary function for this:
def blob_summary(blobs):
print()
for blob in blobs:
b = workspace.FetchBlob('gpu_0/'+blob)
shape = b.shape
b = np.array(b.astype(float)).reshape(-1)
order = np.argsort(b)
step = max(1,len(b)//10)
idxs = np.arange(step//2,len(b),step)
percentiles = b[order[idxs]]
hi = b[order[-1]]
lo = b[order[0]]
abs_mean,mean,std,zeros = [np.format_float_scientific(v,precision=2) for v in [np.abs(b).mean(), b.mean(),b.std(),sum(b == 0.0)/len(b)]]
print(" {} {} ({}): abs mean:{} mean:{} std:{} zeros:{} \nmin-5-15-...-85-95-max percentiles: {} ".format(blob, shape, len(b),
abs_mean,mean,std,zeros,' '.join([np.format_float_scientific(p,precision=2) for p in [lo] + list(percentiles)+ [hi]])))
print()
OK, I will try to print some blobs later, actually I am pretty new to caffe2 :-)
You are so kind and thanks for all your help!
I am using my custom pretrained model to train with the
e2e_faster_rcnn_R-101-FPN_2x
config. But the loss went to NaN after the second forward pass. However, when I used the official pretrained model from ImageNet, nothing went wrong.Expected results
The training will proceed as using the official pretrained model
Actual results
Here is part of the log. After comparing it with the output by training with official weights file, I have found it's abnormal that the number of
rois
below is quite small. I am wondering whether there is something going wrong.Detailed steps to reproduce
The script to transform my classification model(trained from scratch and successfully converged) into the detectron format is written by myself, as follows:
I have tried to set a much smaller learning rate but it did not help. Anybody has some idea? Thanks in advance.
System information
PYTHONPATH
environment variable:python --version
output: 3.6.8, anaconda