Training error on updated latest repository

soumenms2015 commented 6 years ago

While I am trying to run the train_net.py on my own dataset by using retinanet model and getting the following error. I notice there are many changes being incorporated into the repository. Traceback (most recent call last): File "tools/train_net.py", line 281, in main() File "tools/train_net.py", line 119, in main checkpoints = train_model() File "tools/train_net.py", line 128, in train_model model, start_iter, checkpoints, output_dir = create_model() File "tools/train_net.py", line 178, in create_model output_dir = get_output_dir(cfg.TRAIN.DATASETS, training=True) TypeError: get_output_dir() got multiple values for keyword argument 'training' Does anyone have any idea where I am doing any mistake as I am not able to figure out?

gadcam commented 6 years ago

Could you share the code you use to train this model ? It would maybe be easier to see where is the problem.

soumenms2015 commented 6 years ago

I just used the current github repository which is updated as I can see in train_net.py the function output_dir = get_output_dir(cfg.TRAIN.DATASETS, training=True) is updated. Earlier it took one single argument training=True. But the instruction is not updated.

ir413 commented 6 years ago

Hi @soumenms2015, I'm unable to reproduce this issue when training using the latest master.

Could you please double-check that you're not using old module versions? (e.g. you could try running make clean followed by make under detectron/lib )

soumenms2015 commented 6 years ago

Okay ! Thanks a lot. I cleaned and tried again . I got the same error again. I am going to clone the latest repository again. I will update you soon.

soumenms2015 commented 6 years ago

thanks a lot, @ir413 !! Yes, it seems there was any mistake from my side as I tried again with new repository and works. But the old problem still remains same. I have following observations !!

The loss is NaN, earlier this problem was got rid of by reducing the learning rate. But now I reduced it to 0.000001 but the training got stuck at first iteration and bbox_loss is 0 this problem is same as the earlier repository. Here is the error:

INFO net.py: 240: retnet_bbox_conv_n3_fpn6 INFO net.py: 240: retnet_bbox_conv_n3_fpn7 INFO net.py: 240: retnet_bbox_pred_fpn3 INFO net.py: 240: retnet_roi_bbox_targets_fpn3: INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn3: INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_bbox_pred_fpn4 INFO net.py: 240: retnet_roi_bbox_targets_fpn4: INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn4: INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_bbox_pred_fpn5 INFO net.py: 240: retnet_roi_bbox_targets_fpn5: INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn5: INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_bbox_pred_fpn6 INFO net.py: 240: retnet_roi_bbox_targets_fpn6: INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn6: INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_bbox_pred_fpn7 INFO net.py: 240: retnet_roi_bbox_targets_fpn7: INFO net.py: 240: retnet_roi_fg_bbox_locs_fpn7: INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_cls_pred_fpn3 INFO net.py: 240: retnet_cls_labels_fpn3 INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_cls_pred_fpn4 INFO net.py: 240: retnet_cls_labels_fpn4 INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_cls_pred_fpn5 INFO net.py: 240: retnet_cls_labels_fpn5 INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_cls_pred_fpn6 INFO net.py: 240: retnet_cls_labels_fpn6 INFO net.py: 240: retnet_fg_num INFO net.py: 240: retnet_cls_pred_fpn7 INFO net.py: 240: retnet_cls_labels_fpn7 INFO net.py: 240: retnet_fg_num INFO net.py: 244: End of model: /home/.../anaconda2/lib/python2.7/site- r = func(a, **kwargs) json_stats: {"eta": CRITICAL train_net.py: 159: Loss INFO loader.py: 126: Stopping INFO loader.py: 113: Stopping INFO loader.py: 113: Stopping INFO loader.py: 113: Stopping INFO loader.py: 113: Stopping : (2, 256, 22, 10) => retnet_bbox_pred_fpn6 : (2, 36, 22, 10) ------- (op: Conv) : (2, 256, 11, 5) => retnet_bbox_pred_fpn7 : (2, 36, 11, 5) ------- (op: Conv) : (2, 36, 176, 80) => retnet_loss_bbox_fpn3 : () ------- (op: SelectSmoothL1Loss) (46, 4) => retnet_loss_bbox_fpn3 : () ------| (46, 4) => retnet_loss_bbox_fpn3 : () ------| : (1,) => retnet_loss_bbox_fpn3 : () ------| : (2, 36, 88, 40) => retnet_loss_bbox_fpn4 : () ------- (op: SelectSmoothL1Loss) (122, 4) => retnet_loss_bbox_fpn4 : () ------| (122, 4) => retnet_loss_bbox_fpn4 : () ------| : (1,) => retnet_loss_bbox_fpn4 : () ------| : (2, 36, 44, 20) => retnet_loss_bbox_fpn5 : () ------- (op: SelectSmoothL1Loss) (107, 4) => retnet_loss_bbox_fpn5 : () ------| (107, 4) => retnet_loss_bbox_fpn5 : () ------| : (1,) => retnet_loss_bbox_fpn5 : () ------| : (2, 36, 22, 10) => retnet_loss_bbox_fpn6 : () ------- (op: SelectSmoothL1Loss) (74, 4) => retnet_loss_bbox_fpn6 : () ------| (74, 4) => retnet_loss_bbox_fpn6 : () ------| : (1,) => retnet_loss_bbox_fpn6 : () ------| : (2, 36, 11, 5) => retnet_loss_bbox_fpn7 : () ------- (op: SelectSmoothL1Loss) (75, 4) => retnet_loss_bbox_fpn7 : () ------| (75, 4) => retnet_loss_bbox_fpn7 : () ------| : (1,) => retnet_loss_bbox_fpn7 : () ------| : (2, 81, 176, 80) => fl_fpn3 : () ------- (op: SigmoidFocalLoss) : (2, 9, 176, 80) => fl_fpn3 : () ------| : (1,) => fl_fpn3 : () ------| : (2, 81, 88, 40) => fl_fpn4 : () ------- (op: SigmoidFocalLoss) : (2, 9, 88, 40) => fl_fpn4 : () ------| : (1,) => fl_fpn4 : () ------| : (2, 81, 44, 20) => fl_fpn5 : () ------- (op: SigmoidFocalLoss) : (2, 9, 44, 20) => fl_fpn5 : () ------| : (1,) => fl_fpn5 : () ------| : (2, 81, 22, 10) => fl_fpn6 : () ------- (op: SigmoidFocalLoss) : (2, 9, 22, 10) => fl_fpn6 : () ------| : (1,) => fl_fpn6 : () ------| : (2, 81, 11, 5) => fl_fpn7 : () ------- (op: SigmoidFocalLoss) : (2, 9, 11, 5) => fl_fpn7 : () ------| : (1,) => fl_fpn7 : () ------| retinanet packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median "17 days, 4:49:12", "fl_fpn3": 0.000000, "fl_fpn4": 0.000000, "fl_fpn5": 0.000000, "fl_fpn6": NaN, "fl_fpn7": 0.000000, "iter": 0, "loss": NaN, "lr": 0.000000, "mb_qsize": 64, "mem": 9069, "retnet_bg_num": 6678042.000000, "retnet_fg_num": 426.000000, "retnet_loss_bbox_fpn3": 0.000000, "retnet_loss_bbox_fpn4": 0.000000, "retnet_loss_bbox_fpn5": 0.000000, "retnet_loss_bbox_fpn6": 0.000000, "retnet_loss_bbox_fpn7": 0.000000, "time": 16.512804} is NaN, exiting... enqueue thread mini-batch loading thread mini-batch loading thread mini-batch loading thread mini-batch loading thread

I am using the ImageNet pretrained Model ResNext101 Config file : retinanet_X-101-64x4d-FPN_1x.yaml and using my own dataset. Earlier I used the coco dataset as well as my own dataset with old repository and bbox_loss is 0 at every iteration.

Any suggestion/idea would be much appreciated. Thank you very much !

ir413 commented 6 years ago

Hi @soumenms2015 , thanks for confirming. Closing this since the original issue has been addressed. Please open a separate issue for the new training problem you encountered.

facebookresearch / Detectron

Training error on updated latest repository #244