Open xs020420 opened 4 years ago
@xs020420 can you reduce the batch size to 4 or 2 and perform the training if again ur getting the same error then check up with your annotation there much some image loading issues
Thanks for advise! I will try it!
Thanks for your advise again! When I set batch size to 4, it seems no "segmantation fault"(until now, the iter is twice as before) and I guess the problem is solved just by setting batch size to 4. Could you provide some explanation that why I can't set batch size 5 to get a normal training? I currently use 1 GPU and the memory of both GPU and CPU is sufficient.
To get more understanding, could you share the commit you’re running against, the Python version, OS, and perhaps a list of your dependencies and versions?
From the thread so far, it does not look like reducing the batch is reliable. When running out of memory, the error message states (on Linux and MacOS) I’m trying to allocate more than I can—clearly a memory space issue. Just a segfault can mean many things (including a bug related to batch size).
I'm glad to hear of your help! I think the error is not related to batchsize beacause I also meet with "segmentation fault" at 9986 iteration(batch size is 4).I currently save the model each 1000 iteration as a not good stategy to deal with this error. Here is my training detals for conference.
1.base environment: os: Ubuntu 16.04.6 LTS python : 3.6.10 system cuda:10.0 train dataset: coco2014
1.Training commit: python train.py --config=yolact_base_config --batch_size=4 --start_iter=-1 --lr=0.0001 --num_workers=0
2.conda dependencies: channels:
3.all parameters in cfg.dict:
{'dataset': <data.config.Config object at 0x7f2af1e84c18>, 'num_classes': 2, 'max_iter': 20000.0, 'max_num_detections': 100, 'lr': 0.0005, 'momentum': 0.9, 'decay': 0.0005, 'gamma': 0.1, 'lr_steps': [5600.0, 12000.0, 14000.0, 15000.0], 'lr_warmup_init': 0.0001, 'lr_warmup_until': 500, 'conf_alpha': 1, 'bbox_alpha': 1.5, 'mask_alpha': 6.125, 'eval_mask_branch': True, 'nms_top_k': 200, 'nms_conf_thresh': 0.05, 'nms_thresh': 0.5, 'mask_type': 1, 'mask_size': 16, 'masks_to_train': 100, 'mask_proto_src': 0, 'mask_proto_net': [(256, 3, {'padding': 1}), (256, 3, {'padding': 1}), (256, 3, {'padding': 1}), (None, -2, {}), (256, 3, {'padding': 1}), (32, 1, {})], 'mask_proto_bias': False, 'mask_proto_prototype_activation': <function
BWT, when I use gdb to capture the error, the top of the function stack it returns is as follows: "0x00007fffc9bc8555 in std::detail::_Map_base<void, std::pair<void const, (anonymous namespace)::Block>, std::allocator<std::pair<void* const, (anonymous namespace)::Block> >, std::detail::_Select1st, std::equal_to<void>, std::hash<void>, std::detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<false, false, true>, true>::at(void* const&) [clone .constprop.228] ()"
Hi!Thanks for nice job! Here is a bug when I 'm training yolact on COCO dataset. When the iter is "1360" and epoch is "0"(batch size is 5 ), it suddently return a "Segmentation fault" without any tips for debug. Have you ever meet this error or could you give some advise on solving it?