bbox loss is 0 at every iteration

soumenms2015 commented 6 years ago

hello,

I am getting the bbox loss 0 at every iteration. Here is the stats of one iteration: json_stats: {"eta": "3 days, 6:51:17", "fl_fpn3": 0.000000, "fl_fpn4": 0.000000, "fl_fpn5": 0.000000, "fl_fpn6": 0.000000, "fl_fpn7": -58069.230469, "iter": 20, "loss": -58069.230469, "lr": 0.000360, "mb_qsize": 64, "mem": 12239, "retnet_bg_num": 6678044.000000, "retnet_fg_num": 395.500000, "retnet_loss_bbox_fpn3": 0.000000, "retnet_loss_bbox_fpn4": 0.000000, "retnet_loss_bbox_fpn5": 0.000000, "retnet_loss_bbox_fpn6": 0.000000, "retnet_loss_bbox_fpn7": 0.000000, "time": 1.577270} Any idea? How to resolve this issue?

kampelmuehler commented 6 years ago

This is gonna be hard to give any educated comment on unless you give more information on what you are trying to achieve.

soumenms2015 commented 6 years ago

I am trying to run retinanet/ MASK RCNN on my own dataset,As you can see the the bbox_loss is 0 all the time . I tried to change the smoothing the loss with Average instead of median filtering as @rbgirshick suggested in the issue #67 I don't know where it goes wrong.

kampelmuehler commented 6 years ago

Try training COCO for reference first. If everything is working out fine there then you'll need to revisit your dataset generation. It looks like most definitely something with the boxes/masks in your dataset is off.

soumenms2015 commented 6 years ago

Thanks !! Let me try with COCO then let's see if it works. I cross checked with my dataset generation and it seems okay. I am not sure if the dataset has any issue. Will update !! Thanks a lot for your suggestion !

soumenms2015 commented 6 years ago

I have checked with COCO dataset as well and it is giving the same bbox loss 0

kampelmuehler commented 6 years ago

When you are getting losses all over the place with an unmodified config file and unmodified coco dataset then clearly something is wrong with either your detectron or caffe2 repos, in which case I would suggest starting from scratch and checking some baseline models.

soumenms2015 commented 6 years ago

@kampelmuehler I tried with the new repository and checked again. Unfortunately no improvement.

soumenms2015 commented 6 years ago

The loss is NaN, earlier this problem was got rid of by reducing the learning rate. But now I reduced it to 0.000001 but the training got stuck at first iteration and bbox_loss is 0 this problem is same as the earlier repository.
Here is the error:

INFO net.py: 240: retnet_bbox_conv_n3_fpn6 : (2, 256, 22, 10) => retnet_bbox_pred_fpn6 : (2, 36, 22, 10) ------- (op: Conv) INFO net.py: 240: retnet_bbox_conv_n3_fpn7 : (2, 256, 11, 5) => retnet_bbox_pred_fpn7 : (2, 36, 11, 5) ------- (op: Conv) INFO net.py: 240: retnet_bbox_pred_fpn3 : (2, 36, 176, 80) => retnet_loss_bbox_fpn3 : () ------- (op: SelectSmoothL1Loss) INFO net.py: 240: retnet_roi_bbox_targets_fpn3: (46, 4) => retnet_loss_bbox_fpn3 : () ------| INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn3: (46, 4) => retnet_loss_bbox_fpn3 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => retnet_loss_bbox_fpn3 : () ------| INFO net.py: 240: retnet_bbox_pred_fpn4 : (2, 36, 88, 40) => retnet_loss_bbox_fpn4 : () ------- (op: SelectSmoothL1Loss) INFO net.py: 240: retnet_roi_bbox_targets_fpn4: (122, 4) => retnet_loss_bbox_fpn4 : () ------| INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn4: (122, 4) => retnet_loss_bbox_fpn4 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => retnet_loss_bbox_fpn4 : () ------| INFO net.py: 240: retnet_bbox_pred_fpn5 : (2, 36, 44, 20) => retnet_loss_bbox_fpn5 : () ------- (op: SelectSmoothL1Loss) INFO net.py: 240: retnet_roi_bbox_targets_fpn5: (107, 4) => retnet_loss_bbox_fpn5 : () ------| INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn5: (107, 4) => retnet_loss_bbox_fpn5 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => retnet_loss_bbox_fpn5 : () ------| INFO net.py: 240: retnet_bbox_pred_fpn6 : (2, 36, 22, 10) => retnet_loss_bbox_fpn6 : () ------- (op: SelectSmoothL1Loss) INFO net.py: 240: retnet_roi_bbox_targets_fpn6: (74, 4) => retnet_loss_bbox_fpn6 : () ------| INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn6: (74, 4) => retnet_loss_bbox_fpn6 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => retnet_loss_bbox_fpn6 : () ------| INFO net.py: 240: retnet_bbox_pred_fpn7 : (2, 36, 11, 5) => retnet_loss_bbox_fpn7 : () ------- (op: SelectSmoothL1Loss) INFO net.py: 240: retnet_roi_bbox_targets_fpn7: (75, 4) => retnet_loss_bbox_fpn7 : () ------| INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn7: (75, 4) => retnet_loss_bbox_fpn7 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => retnet_loss_bbox_fpn7 : () ------| INFO net.py: 240: retnet_cls_pred_fpn3 : (2, 81, 176, 80) => fl_fpn3 : () ------- (op: SigmoidFocalLoss) INFO net.py: 240: retnet_cls_labels_fpn3 : (2, 9, 176, 80) => fl_fpn3 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => fl_fpn3 : () ------| INFO net.py: 240: retnet_cls_pred_fpn4 : (2, 81, 88, 40) => fl_fpn4 : () ------- (op: SigmoidFocalLoss) INFO net.py: 240: retnet_cls_labels_fpn4 : (2, 9, 88, 40) => fl_fpn4 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => fl_fpn4 : () ------| INFO net.py: 240: retnet_cls_pred_fpn5 : (2, 81, 44, 20) => fl_fpn5 : () ------- (op: SigmoidFocalLoss) INFO net.py: 240: retnet_cls_labels_fpn5 : (2, 9, 44, 20) => fl_fpn5 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => fl_fpn5 : () ------| INFO net.py: 240: retnet_cls_pred_fpn6 : (2, 81, 22, 10) => fl_fpn6 : () ------- (op: SigmoidFocalLoss) INFO net.py: 240: retnet_cls_labels_fpn6 : (2, 9, 22, 10) => fl_fpn6 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => fl_fpn6 : () ------| INFO net.py: 240: retnet_cls_pred_fpn7 : (2, 81, 11, 5) => fl_fpn7 : () ------- (op: SigmoidFocalLoss) INFO net.py: 240: retnet_cls_labels_fpn7 : (2, 9, 11, 5) => fl_fpn7 : () ------| INFO net.py: 240: retnet_fg_num : (1,) => fl_fpn7 : () ------| INFO net.py: 244: End of model: retinanet /home/.../anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median r = func(a, **kwargs) json_stats: {"eta": "17 days, 4:49:12", "fl_fpn3": 0.000000, "fl_fpn4": 0.000000, "fl_fpn5": 0.000000, "fl_fpn6": NaN, "fl_fpn7": 0.000000, "iter": 0, "loss": NaN, "lr": 0.000000, "mb_qsize": 64, "mem": 9069, "retnet_bg_num": 6678042.000000, "retnet_fg_num": 426.000000, "retnet_loss_bbox_fpn3": 0.000000, "retnet_loss_bbox_fpn4": 0.000000, "retnet_loss_bbox_fpn5": 0.000000, "retnet_loss_bbox_fpn6": 0.000000, "retnet_loss_bbox_fpn7": 0.000000, "time": 16.512804} CRITICAL train_net.py: 159: Loss is NaN, exiting... INFO loader.py: 126: Stopping enqueue thread INFO loader.py: 113: Stopping mini-batch loading thread INFO loader.py: 113: Stopping mini-batch loading thread INFO loader.py: 113: Stopping mini-batch loading thread INFO loader.py: 113: Stopping mini-batch loading thread

I am using the ImageNet pretrained Model ResNext101
Config file : retinanet_X-101-64x4d-FPN_1x.yaml and using my own dataset.
Earlier I used the coco dataset as well as my own dataset with old repository and bbox_loss is 0 at every iteration.

Any suggestion/idea would be much appreciated. Thank you very much !

stanleychima commented 6 years ago

okay here what you are going to do have cheak the setting

On Mon, Mar 5, 2018 at 2:46 PM, Moritz Kampelmühler < notifications@github.com> wrote:

This is gonna be hard to give any educated comment on unless you give more information on what you are trying to achieve.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/Detectron/issues/205#issuecomment-370423023, or mute the thread https://github.com/notifications/unsubscribe-auth/AjJssJvL0uhBHyBw6LuZBp4_FhrH0hqZks5tbUGAgaJpZM4SWkQX .

facebookresearch / Detectron

bbox loss is 0 at every iteration #205