Segmentation fault - Githubissues

xs020420 commented 4 years ago

Hi！Thanks for nice job! Here is a bug when I 'm training yolact on COCO dataset. When the iter is "1360" and epoch is "0"(batch size is 5 ), it suddently return a "Segmentation fault" without any tips for debug. Have you ever meet this error or could you give some advise on solving it?

abhigoku10 commented 4 years ago

@xs020420 can you reduce the batch size to 4 or 2 and perform the training if again ur getting the same error then check up with your annotation there much some image loading issues

xs020420 commented 4 years ago

Thanks for advise! I will try it!

xs020420 commented 4 years ago

Thanks for your advise again! When I set batch size to 4, it seems no "segmantation fault"(until now, the iter is twice as before) and I guess the problem is solved just by setting batch size to 4. Could you provide some explanation that why I can't set batch size 5 to get a normal training? I currently use 1 GPU and the memory of both GPU and CPU is sufficient.

ic commented 4 years ago

To get more understanding, could you share the commit you’re running against, the Python version, OS, and perhaps a list of your dependencies and versions?

From the thread so far, it does not look like reducing the batch is reliable. When running out of memory, the error message states (on Linux and MacOS) I’m trying to allocate more than I can—clearly a memory space issue. Just a segfault can mean many things (including a bug related to batch size).

xs020420 commented 4 years ago

I'm glad to hear of your help! I think the error is not related to batchsize beacause I also meet with "segmentation fault" at 9986 iteration(batch size is 4).I currently save the model each 1000 iteration as a not good stategy to deal with this error. Here is my training detals for conference.

1.base environment: os: Ubuntu 16.04.6 LTS python : 3.6.10 system cuda:10.0 train dataset: coco2014

1.Training commit: python train.py --config=yolact_base_config --batch_size=4 --start_iter=-1 --lr=0.0001 --num_workers=0

2.conda dependencies: channels:

https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
defaults
conda-forge dependencies:
libmediainfo=20.03=h0b14f55_0
libzen=0.4.38=he1b5a44_0
pymediainfo=4.2.1=py36h9f0ad1d_1
python_abi=3.6=1_cp36m
tinyxml2=8.0.0=he1b5a44_1
_libgcc_mutex=0.1=main
_pytorch_select=0.2=gpu_0
blas=1.0=mkl
ca-certificates=2020.1.1=0
certifi=2020.6.20=py36_0
cffi=1.14.0=py36he30daa8_1
cudatoolkit=10.0.130=0
cudnn=7.6.5=cuda10.0_0
freetype=2.10.2=h5ab3b9f_0
intel-openmp=2020.1=217
jpeg=9b=h024ee3a_2
krb5=1.17.1=h173b8e3_0
ld_impl_linux-64=2.33.1=h53a641e_7
libcurl=7.69.1=h20c2e04_0
libedit=3.1.20191231=h7b6447c_0
libffi=3.3=he6710b0_1
libgcc-ng=9.1.0=hdf63c60_0
libgfortran-ng=7.3.0=hdf63c60_0
libpng=1.6.37=hbc83047_0
libssh2=1.9.0=h1ba5d50_1
libstdcxx-ng=9.1.0=hdf63c60_0
libtiff=4.1.0=h2733197_1
lz4-c=1.9.2=he6710b0_0
mkl=2020.1=217
mkl-service=2.3.0=py36he904b0f_0
mkl_fft=1.1.0=py36h23d657b_0
mkl_random=1.1.1=py36h0573a6f_0
ncurses=6.2=he6710b0_1
ninja=1.9.0=py36hfd86e86_0
numpy=1.18.5=py36ha1c710e_0
numpy-base=1.18.5=py36hde5b4d6_0
olefile=0.46=py_0
openssl=1.1.1g=h7b6447c_0
pillow=7.1.2=py36hb39fc2d_0
pip=20.1.1=py36_1
pycparser=2.20=py_0
python=3.6.10=h7579374_2
pytorch=1.2.0=cuda100py36h938c94c_0
readline=8.0=h7b6447c_0
setuptools=47.3.1=py36_0
six=1.15.0=py_0
sqlite=3.32.3=h62c20be_0
tk=8.6.10=hbc83047_0
torchvision=0.4.0=cuda100py36hecfc37a_0
wheel=0.34.2=py36_0
xz=5.2.5=h7b6447c_0
zlib=1.2.11=h7b6447c_3
zstd=1.4.4=h0b5b093_3
pip:
- astroid==2.4.2
- attrs==19.3.0
- cycler==0.10.0
- cython==0.29.20
- cython-bbox==0.1.3
- decorator==4.4.2
- et-xmlfile==1.0.1
- flake8==3.8.3
- flake8-import-order==0.18.1
- imageio==2.9.0
- importlib-metadata==1.7.0
- isort==4.3.21
- jdcal==1.4.1
- kiwisolver==1.2.0
- lap==0.4.0
- lazy-object-proxy==1.4.3
- llvmlite==0.33.0
- matplotlib==3.2.2
- mccabe==0.6.1
- more-itertools==8.4.0
- motmetrics==1.2.0
- networkx==2.4
- numba==0.50.1
- opencv-python==4.2.0.34
- openpyxl==3.0.4
- packaging==20.4
- pandas==1.0.5
- pluggy==0.13.1
- progress==1.5
- protobuf==3.12.2
- py==1.9.0
- py-cpuinfo==6.0.0
- pycocotools==2.0.1
- pycodestyle==2.6.0
- pyflakes==2.2.0
- pylint==2.5.3
- pyparsing==2.4.7
- pytest==5.4.3
- pytest-benchmark==3.2.3
- python-dateutil==2.8.1
- pytz==2020.1
- pywavelets==1.1.1
- pyyaml==5.3.1
- scikit-image==0.17.2
- scipy==1.5.0
- tensorboardx==2.0
- tifffile==2020.7.4
- toml==0.10.1
- torch==1.2.0
- tqdm==4.47.0
- typed-ast==1.4.1
- wcwidth==0.2.5
- wrapt==1.12.1
- xmltodict==0.12.0
- yacs==0.1.7
- zipp==3.1.0

3.all parameters in cfg.dict: {'dataset': <data.config.Config object at 0x7f2af1e84c18>, 'num_classes': 2, 'max_iter': 20000.0, 'max_num_detections': 100, 'lr': 0.0005, 'momentum': 0.9, 'decay': 0.0005, 'gamma': 0.1, 'lr_steps': [5600.0, 12000.0, 14000.0, 15000.0], 'lr_warmup_init': 0.0001, 'lr_warmup_until': 500, 'conf_alpha': 1, 'bbox_alpha': 1.5, 'mask_alpha': 6.125, 'eval_mask_branch': True, 'nms_top_k': 200, 'nms_conf_thresh': 0.05, 'nms_thresh': 0.5, 'mask_type': 1, 'mask_size': 16, 'masks_to_train': 100, 'mask_proto_src': 0, 'mask_proto_net': [(256, 3, {'padding': 1}), (256, 3, {'padding': 1}), (256, 3, {'padding': 1}), (None, -2, {}), (256, 3, {'padding': 1}), (32, 1, {})], 'mask_proto_bias': False, 'mask_proto_prototype_activation': <function at 0x7f2a94aec378>, 'mask_proto_mask_activation': <built-in method sigmoid of type object at 0x7f2ae4573420>, 'mask_proto_coeff_activation': <built-in method tanh of type object at 0x7f2ae4573420>, 'mask_proto_crop': True, 'mask_proto_crop_expand': 0, 'mask_proto_loss': None, 'mask_proto_binarize_downsampled_gt': True, 'mask_proto_normalize_mask_loss_by_sqrt_area': False, 'mask_proto_reweight_mask_loss': False, 'mask_proto_grid_file': 'data/grid.npy', 'mask_proto_use_grid': False, 'mask_proto_coeff_gate': False, 'mask_proto_prototypes_as_features': False, 'mask_proto_prototypes_as_features_no_grad': False, 'mask_proto_remove_empty_masks': False, 'mask_proto_reweight_coeff': 1, 'mask_proto_coeff_diversity_loss': False, 'mask_proto_coeff_diversity_alpha': 1, 'mask_proto_normalize_emulate_roi_pooling': True, 'mask_proto_double_loss': False, 'mask_proto_double_loss_alpha': 1, 'mask_proto_split_prototypes_by_head': False, 'mask_proto_crop_with_pred_box': False, 'augment_photometric_distort': True, 'augment_expand': True, 'augment_random_sample_crop': True, 'augment_random_mirror': True, 'augment_random_flip': False, 'augment_random_rot90': False, 'discard_box_width': 0.007272727272727273, 'discard_box_height': 0.007272727272727273, 'freeze_bn': True, 'fpn': <data.config.Config object at 0x7f2a94b5c320>, 'share_prediction_module': True, 'ohem_use_most_confident': False, 'use_focal_loss': False, 'focal_loss_alpha': 0.25, 'focal_loss_gamma': 2, 'focal_loss_init_pi': 0.01, 'use_class_balanced_conf': False, 'use_sigmoid_focal_loss': False, 'use_objectness_score': False, 'use_class_existence_loss': False, 'class_existence_alpha': 1, 'use_semantic_segmentation_loss': True, 'semantic_segmentation_alpha': 1, 'use_mask_scoring': False, 'mask_scoring_alpha': 1, 'use_change_matching': False, 'extra_head_net': [(256, 3, {'padding': 1})], 'head_layer_params': {'kernel_size': 3, 'padding': 1}, 'extra_layers': (0, 0, 0), 'positive_iou_threshold': 0.5, 'negative_iou_threshold': 0.4, 'ohem_negpos_ratio': 3, 'crowd_iou_threshold': 0.7, 'mask_dim': 32, 'max_size': 550, 'force_cpu_nms': True, 'use_coeff_nms': False, 'use_instance_coeff': False, 'num_instance_coeffs': 64, 'train_masks': True, 'train_boxes': True, 'use_gt_bboxes': False, 'preserve_aspect_ratio': True, 'use_prediction_module': False, 'use_yolo_regressors': False, 'use_prediction_matching': False, 'delayed_settings': [], 'no_jit': False, 'backbone': <data.config.Config object at 0x7f2a94b5c2e8>, 'name': 'yolact_base', 'use_maskiou': False, 'maskiou_net': [], 'discard_mask_area': -1, 'maskiou_alpha': 1.0, 'rescore_mask': False, 'rescore_bbox': False, 'maskious_to_train': -1, 'num_heads': 5, '_tmp_img_h': 550, '_tmp_img_w': 550}

xs020420 commented 4 years ago

BWT, when I use gdb to capture the error, the top of the function stack it returns is as follows: "0x00007fffc9bc8555 in std::detail::_Map_base<void, std::pair<void const, (anonymous namespace)::Block>, std::allocator<std::pair<void* const, (anonymous namespace)::Block> >, std::detail::_Select1st, std::equal_to<void>, std::hash<void>, std::detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::at(void* const&) [clone .constprop.228] ()"

dbolya / yolact

Segmentation fault #483