dbolya / yolact

A simple, fully convolutional model for real-time instance segmentation.
MIT License
5.02k stars 1.32k forks source link

Segmentation fault #483

Open xs020420 opened 4 years ago

xs020420 commented 4 years ago

Hi!Thanks for nice job! Here is a bug when I 'm training yolact on COCO dataset. When the iter is "1360" and epoch is "0"(batch size is 5 ), it suddently return a "Segmentation fault" without any tips for debug. Have you ever meet this error or could you give some advise on solving it?

abhigoku10 commented 4 years ago

@xs020420 can you reduce the batch size to 4 or 2 and perform the training if again ur getting the same error then check up with your annotation there much some image loading issues

xs020420 commented 4 years ago

Thanks for advise! I will try it!

xs020420 commented 4 years ago

Thanks for your advise again! When I set batch size to 4, it seems no "segmantation fault"(until now, the iter is twice as before) and I guess the problem is solved just by setting batch size to 4. Could you provide some explanation that why I can't set batch size 5 to get a normal training? I currently use 1 GPU and the memory of both GPU and CPU is sufficient.

ic commented 4 years ago

To get more understanding, could you share the commit you’re running against, the Python version, OS, and perhaps a list of your dependencies and versions?

From the thread so far, it does not look like reducing the batch is reliable. When running out of memory, the error message states (on Linux and MacOS) I’m trying to allocate more than I can—clearly a memory space issue. Just a segfault can mean many things (including a bug related to batch size).

xs020420 commented 4 years ago

I'm glad to hear of your help! I think the error is not related to batchsize beacause I also meet with "segmentation fault" at 9986 iteration(batch size is 4).I currently save the model each 1000 iteration as a not good stategy to deal with this error. Here is my training detals for conference.

1.base environment: os: Ubuntu 16.04.6 LTS python : 3.6.10 system cuda:10.0 train dataset: coco2014

1.Training commit: python train.py --config=yolact_base_config --batch_size=4 --start_iter=-1 --lr=0.0001 --num_workers=0

2.conda dependencies: channels:

3.all parameters in cfg.dict: {'dataset': <data.config.Config object at 0x7f2af1e84c18>, 'num_classes': 2, 'max_iter': 20000.0, 'max_num_detections': 100, 'lr': 0.0005, 'momentum': 0.9, 'decay': 0.0005, 'gamma': 0.1, 'lr_steps': [5600.0, 12000.0, 14000.0, 15000.0], 'lr_warmup_init': 0.0001, 'lr_warmup_until': 500, 'conf_alpha': 1, 'bbox_alpha': 1.5, 'mask_alpha': 6.125, 'eval_mask_branch': True, 'nms_top_k': 200, 'nms_conf_thresh': 0.05, 'nms_thresh': 0.5, 'mask_type': 1, 'mask_size': 16, 'masks_to_train': 100, 'mask_proto_src': 0, 'mask_proto_net': [(256, 3, {'padding': 1}), (256, 3, {'padding': 1}), (256, 3, {'padding': 1}), (None, -2, {}), (256, 3, {'padding': 1}), (32, 1, {})], 'mask_proto_bias': False, 'mask_proto_prototype_activation': <function at 0x7f2a94aec378>, 'mask_proto_mask_activation': <built-in method sigmoid of type object at 0x7f2ae4573420>, 'mask_proto_coeff_activation': <built-in method tanh of type object at 0x7f2ae4573420>, 'mask_proto_crop': True, 'mask_proto_crop_expand': 0, 'mask_proto_loss': None, 'mask_proto_binarize_downsampled_gt': True, 'mask_proto_normalize_mask_loss_by_sqrt_area': False, 'mask_proto_reweight_mask_loss': False, 'mask_proto_grid_file': 'data/grid.npy', 'mask_proto_use_grid': False, 'mask_proto_coeff_gate': False, 'mask_proto_prototypes_as_features': False, 'mask_proto_prototypes_as_features_no_grad': False, 'mask_proto_remove_empty_masks': False, 'mask_proto_reweight_coeff': 1, 'mask_proto_coeff_diversity_loss': False, 'mask_proto_coeff_diversity_alpha': 1, 'mask_proto_normalize_emulate_roi_pooling': True, 'mask_proto_double_loss': False, 'mask_proto_double_loss_alpha': 1, 'mask_proto_split_prototypes_by_head': False, 'mask_proto_crop_with_pred_box': False, 'augment_photometric_distort': True, 'augment_expand': True, 'augment_random_sample_crop': True, 'augment_random_mirror': True, 'augment_random_flip': False, 'augment_random_rot90': False, 'discard_box_width': 0.007272727272727273, 'discard_box_height': 0.007272727272727273, 'freeze_bn': True, 'fpn': <data.config.Config object at 0x7f2a94b5c320>, 'share_prediction_module': True, 'ohem_use_most_confident': False, 'use_focal_loss': False, 'focal_loss_alpha': 0.25, 'focal_loss_gamma': 2, 'focal_loss_init_pi': 0.01, 'use_class_balanced_conf': False, 'use_sigmoid_focal_loss': False, 'use_objectness_score': False, 'use_class_existence_loss': False, 'class_existence_alpha': 1, 'use_semantic_segmentation_loss': True, 'semantic_segmentation_alpha': 1, 'use_mask_scoring': False, 'mask_scoring_alpha': 1, 'use_change_matching': False, 'extra_head_net': [(256, 3, {'padding': 1})], 'head_layer_params': {'kernel_size': 3, 'padding': 1}, 'extra_layers': (0, 0, 0), 'positive_iou_threshold': 0.5, 'negative_iou_threshold': 0.4, 'ohem_negpos_ratio': 3, 'crowd_iou_threshold': 0.7, 'mask_dim': 32, 'max_size': 550, 'force_cpu_nms': True, 'use_coeff_nms': False, 'use_instance_coeff': False, 'num_instance_coeffs': 64, 'train_masks': True, 'train_boxes': True, 'use_gt_bboxes': False, 'preserve_aspect_ratio': True, 'use_prediction_module': False, 'use_yolo_regressors': False, 'use_prediction_matching': False, 'delayed_settings': [], 'no_jit': False, 'backbone': <data.config.Config object at 0x7f2a94b5c2e8>, 'name': 'yolact_base', 'use_maskiou': False, 'maskiou_net': [], 'discard_mask_area': -1, 'maskiou_alpha': 1.0, 'rescore_mask': False, 'rescore_bbox': False, 'maskious_to_train': -1, 'num_heads': 5, '_tmp_img_h': 550, '_tmp_img_w': 550}

xs020420 commented 4 years ago

BWT, when I use gdb to capture the error, the top of the function stack it returns is as follows: "0x00007fffc9bc8555 in std::detail::_Map_base<void, std::pair<void const, (anonymous namespace)::Block>, std::allocator<std::pair<void* const, (anonymous namespace)::Block> >, std::detail::_Select1st, std::equal_to<void>, std::hash<void>, std::detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::at(void* const&) [clone .constprop.228] ()"