Closed YJHMITWEB closed 6 years ago
@YJHMITWEB The issue I encountered was the same as you. How did you solve it? Thank you!
@ChongZhangZC @YJHMITWEB Are you using the latest version of detectron? I think after https://github.com/pytorch/pytorch/pull/6896 was merged, the issue you met should be solved (your error message is the same with https://github.com/pytorch/pytorch/pull/6896#issuecomment-384483143)
hi @daquexian, I just encountered the same issue and I ensured it's the latest version of detectron. With the provided Docker file, I build up the docker image to implement and even download the coco dataset 2017 to re-train model (and it works)
However, when I replace my own dataset, this issue came out.
I have 4 GTX 1080Ti on one machine, but I only want to use No.0,2,3 of them, leaving No.1 to others. So I use CUDA_VISIBLE_DEVICES=0,2,3 in my command and I also changed the config file to NUM_GPUS:3. However, it throws this:
jinghan@amax:~/Research/detectron$ CUDA_VISIBLE_DEVICES=0,2,3 python tools/train_net.py --cfg Experiments/cfgs/retina_50.yaml OUTPUT_DIR Experiments/Outputs Found Detectron ops lib: /home/yuz/work/caffe2/build/lib/libcaffe2_detectron_ops_gpu.so INFO train_net.py: 95: Called with args: INFO train_net.py: 96: Namespace(cfg_file='Experiments/cfgs/retina_50.yaml', multi_gpu_testing=False, opts=['OUTPUT_DIR', 'Experiments/Outputs'], skip_test=False) INFO train_net.py: 102: Training with config: INFO train_net.py: 103: {'BBOX_XFORM_CLIP': 4.1351665567423561, 'CLUSTER': {'ON_CLUSTER': False}, 'DATA_LOADER': {'BLOBS_QUEUE_CAPACITY': 8, 'MINIBATCH_QUEUE_SIZE': 64, 'NUM_THREADS': 4}, 'DEDUP_BOXES': 0.0625, 'DOWNLOAD_CACHE': '/tmp/detectron-download-cache', 'EPS': 1e-14, 'EXPECTED_RESULTS': [], 'EXPECTED_RESULTS_ATOL': 0.005, 'EXPECTED_RESULTS_EMAIL': '', 'EXPECTED_RESULTS_RTOL': 0.1, 'FAST_RCNN': {'CONV_HEAD_DIM': 256, 'MLP_HEAD_DIM': 1024, 'NUM_STACKED_CONVS': 4, 'ROI_BOX_HEAD': '', 'ROI_XFORM_METHOD': 'RoIPoolF', 'ROI_XFORM_RESOLUTION': 14, 'ROI_XFORM_SAMPLING_RATIO': 0}, 'FPN': {'COARSEST_STRIDE': 128, 'DIM': 256, 'EXTRA_CONV_LEVELS': True, 'FPN_ON': True, 'MULTILEVEL_ROIS': False, 'MULTILEVEL_RPN': True, 'ROI_CANONICAL_LEVEL': 4, 'ROI_CANONICAL_SCALE': 224, 'ROI_MAX_LEVEL': 5, 'ROI_MIN_LEVEL': 2, 'RPN_ANCHOR_START_SIZE': 32, 'RPN_ASPECT_RATIOS': (0.5, 1, 2), 'RPN_MAX_LEVEL': 7, 'RPN_MIN_LEVEL': 3, 'USE_GN': False, 'ZERO_INIT_LATERAL': False}, 'GROUP_NORM': {'DIM_PER_GP': -1, 'EPSILON': 1e-05, 'NUM_GROUPS': 32}, 'KRCNN': {'CONV_HEAD_DIM': 256, 'CONV_HEAD_KERNEL': 3, 'CONV_INIT': 'GaussianFill', 'DECONV_DIM': 256, 'DECONV_KERNEL': 4, 'DILATION': 1, 'HEATMAP_SIZE': -1, 'INFERENCE_MIN_SIZE': 0, 'KEYPOINT_CONFIDENCE': 'bbox', 'LOSS_WEIGHT': 1.0, 'MIN_KEYPOINT_COUNT_FOR_VALID_MINIBATCH': 20, 'NMS_OKS': False, 'NORMALIZE_BY_VISIBLE_KEYPOINTS': True, 'NUM_KEYPOINTS': -1, 'NUM_STACKED_CONVS': 8, 'ROI_KEYPOINTS_HEAD': '', 'ROI_XFORM_METHOD': 'RoIAlign', 'ROI_XFORM_RESOLUTION': 7, 'ROI_XFORM_SAMPLING_RATIO': 0, 'UP_SCALE': -1, 'USE_DECONV': False, 'USE_DECONV_OUTPUT': False}, 'MATLAB': 'matlab', 'MEMONGER': True, 'MEMONGER_SHARE_ACTIVATIONS': False, 'MODEL': {'BBOX_REG_WEIGHTS': (10.0, 10.0, 5.0, 5.0), 'CLS_AGNOSTIC_BBOX_REG': False, 'CONV_BODY': 'FPN.add_fpn_ResNet50_conv5_body', 'EXECUTION_TYPE': 'dag', 'FASTER_RCNN': False, 'KEYPOINTS_ON': False, 'MASK_ON': False, 'NUM_CLASSES': 81, 'RPN_ONLY': False, 'TYPE': 'retinanet'}, 'MRCNN': {'CLS_SPECIFIC_MASK': True, 'CONV_INIT': 'GaussianFill', 'DILATION': 2, 'DIM_REDUCED': 256, 'RESOLUTION': 14, 'ROI_MASK_HEAD': '', 'ROI_XFORM_METHOD': 'RoIAlign', 'ROI_XFORM_RESOLUTION': 7, 'ROI_XFORM_SAMPLING_RATIO': 0, 'THRESH_BINARIZE': 0.5, 'UPSAMPLE_RATIO': 1, 'USE_FC_OUTPUT': False, 'WEIGHT_LOSS_MASK': 1.0}, 'NUM_GPUS': 3, 'OUTPUT_DIR': 'Experiments/Outputs', 'PIXEL_MEANS': array([[[ 102.9801, 115.9465, 122.7717]]]), 'RESNETS': {'NUM_GROUPS': 1, 'RES5_DILATION': 1, 'SHORTCUT_FUNC': 'basic_bn_shortcut', 'STEM_FUNC': 'basic_bn_stem', 'STRIDE_1X1': True, 'TRANS_FUNC': 'bottleneck_transformation', 'WIDTH_PER_GROUP': 64}, 'RETINANET': {'ANCHOR_SCALE': 4, 'ASPECT_RATIOS': (1.0, 2.0, 0.5), 'BBOX_REG_BETA': 0.11, 'BBOX_REG_WEIGHT': 1.0, 'CLASS_SPECIFIC_BBOX': False, 'INFERENCE_TH': 0.05, 'LOSS_ALPHA': 0.25, 'LOSS_GAMMA': 2.0, 'NEGATIVE_OVERLAP': 0.4, 'NUM_CONVS': 4, 'POSITIVE_OVERLAP': 0.5, 'PRE_NMS_TOP_N': 1000, 'PRIOR_PROB': 0.01, 'RETINANET_ON': True, 'SCALES_PER_OCTAVE': 3, 'SHARE_CLS_BBOX_TOWER': False, 'SOFTMAX': False}, 'RFCN': {'PS_GRID_SIZE': 3}, 'RNG_SEED': 3, 'ROOT_DIR': '/home/jinghan/Research/detectron', 'RPN': {'ASPECT_RATIOS': (0.5, 1, 2), 'RPN_ON': False, 'SIZES': (64, 128, 256, 512), 'STRIDE': 16}, 'SOLVER': {'BASE_LR': 0.01, 'GAMMA': 0.1, 'LOG_LR_CHANGE_THRESHOLD': 1.1, 'LRS': [], 'LR_POLICY': 'steps_with_decay', 'MAX_ITER': 180000, 'MOMENTUM': 0.9, 'SCALE_MOMENTUM': True, 'SCALE_MOMENTUM_THRESHOLD': 1.1, 'STEPS': [0, 120000, 160000], 'STEP_SIZE': 30000, 'WARM_UP_FACTOR': 0.3333333333333333, 'WARM_UP_ITERS': 500, 'WARM_UP_METHOD': u'linear', 'WEIGHT_DECAY': 0.0001, 'WEIGHT_DECAY_GN': 0.0}, 'TEST': {'BBOX_AUG': {'AREA_TH_HI': 32400, 'AREA_TH_LO': 2500, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'COORD_HEUR': 'UNION', 'ENABLED': False, 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False, 'SCORE_HEUR': 'UNION'}, 'BBOX_REG': True, 'BBOX_VOTE': {'ENABLED': False, 'SCORING_METHOD': 'ID', 'SCORING_METHOD_BETA': 1.0, 'VOTE_TH': 0.8}, 'COMPETITION_MODE': True, 'DATASETS': ('coco_2017_val',), 'DETECTIONS_PER_IM': 100, 'FORCE_JSON_DATASET_EVAL': False, 'KPS_AUG': {'AREA_TH': 32400, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'ENABLED': False, 'HEUR': 'HM_AVG', 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False}, 'MASK_AUG': {'AREA_TH': 32400, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'ENABLED': False, 'HEUR': 'SOFT_AVG', 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False}, 'MAX_SIZE': 1333, 'NMS': 0.5, 'PRECOMPUTED_PROPOSALS': False, 'PROPOSAL_FILES': (), 'PROPOSAL_LIMIT': 2000, 'RPN_MIN_SIZE': 0, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 10000, 'SCALE': 800, 'SCORE_THRESH': 0.05, 'SOFT_NMS': {'ENABLED': False, 'METHOD': 'linear', 'SIGMA': 0.5}, 'WEIGHTS': ''}, 'TRAIN': {'ASPECT_GROUPING': True, 'AUTO_RESUME': True, 'BATCH_SIZE_PER_IM': 64, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'COPY_WEIGHTS': False, 'CROWD_FILTER_THRESH': 0.7, 'DATASETS': ('coco_2017_train',), 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'FREEZE_AT': 2, 'FREEZE_CONV_BODY': False, 'GT_MIN_AREA': -1, 'IMS_PER_BATCH': 2, 'MAX_SIZE': 1333, 'PROPOSAL_FILES': (), 'RPN_BATCH_SIZE_PER_IM': 256, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 0, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'RPN_STRADDLE_THRESH': -1, 'SCALES': (800,), 'SNAPSHOT_ITERS': 20000, 'USE_FLIPPED': True, 'WEIGHTS': '/home/jinghan/Research/Detectron/Models/R-50.pkl'}, 'USE_NCCL': False, 'VIS': False, 'VIS_TH': 0.9} INFO train.py: 138: Building model: retinanet WARNING cnn.py: 40: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information. WARNING memonger.py: 70: NOTE: Executing memonger to optimize gradient memory I0714 21:49:23.489017 23172 memonger.cc:252] Remapping 122 using 26 shared blobs. INFO memonger.py: 112: Memonger memory optimization took 0.222306013107 secs WARNING memonger.py: 70: NOTE: Executing memonger to optimize gradient memory I0714 21:49:24.268831 23172 memonger.cc:252] Remapping 122 using 26 shared blobs. INFO memonger.py: 112: Memonger memory optimization took 0.202149868011 secs WARNING memonger.py: 70: NOTE: Executing memonger to optimize gradient memory I0714 21:49:25.027472 23172 memonger.cc:252] Remapping 122 using 26 shared blobs. INFO memonger.py: 112: Memonger memory optimization took 0.203737974167 secs I0714 21:49:27.005168 23172 context_gpu.cu:321] GPU 0: 129 MB I0714 21:49:27.005203 23172 context_gpu.cu:325] Total: 129 MB I0714 21:49:27.372752 23172 context_gpu.cu:321] GPU 0: 144 MB I0714 21:49:27.372772 23172 context_gpu.cu:321] GPU 1: 117 MB I0714 21:49:27.372777 23172 context_gpu.cu:325] Total: 262 MB I0714 21:49:27.734133 23172 context_gpu.cu:321] GPU 0: 144 MB I0714 21:49:27.734153 23172 context_gpu.cu:321] GPU 1: 144 MB I0714 21:49:27.734158 23172 context_gpu.cu:321] GPU 2: 117 MB I0714 21:49:27.734161 23172 context_gpu.cu:325] Total: 407 MB I0714 21:49:27.812615 23172 context_gpu.cu:321] GPU 0: 261 MB I0714 21:49:27.812630 23172 context_gpu.cu:321] GPU 1: 144 MB I0714 21:49:27.812634 23172 context_gpu.cu:321] GPU 2: 144 MB I0714 21:49:27.812638 23172 context_gpu.cu:325] Total: 551 MB I0714 21:49:27.842473 23172 context_gpu.cu:321] GPU 0: 288 MB I0714 21:49:27.842489 23172 context_gpu.cu:321] GPU 1: 261 MB I0714 21:49:27.842492 23172 context_gpu.cu:321] GPU 2: 144 MB I0714 21:49:27.842496 23172 context_gpu.cu:325] Total: 695 MB I0714 21:49:27.940443 23172 context_gpu.cu:321] GPU 0: 288 MB I0714 21:49:27.940472 23172 context_gpu.cu:321] GPU 1: 288 MB I0714 21:49:27.940481 23172 context_gpu.cu:321] GPU 2: 261 MB I0714 21:49:27.940490 23172 context_gpu.cu:325] Total: 838 MB INFO train.py: 186: Loading dataset: ('coco_2017_train',) loading annotations into memory... Done (t=20.24s) creating index... index created! INFO roidb.py: 49: Appending horizontally-flipped training examples... INFO roidb.py: 51: Loaded dataset: coco_2017_train INFO roidb.py: 135: Filtered 2042 roidb entries: 236574 -> 234532 INFO roidb.py: 67: Computing bounding-box regression targets... INFO roidb.py: 69: done INFO train.py: 190: 234532 roidb entries INFO net.py: 59: Loading weights from: /home/jinghan/Research/Detectron/Models/R-50.pkl INFO net.py: 88: fpn_inner_res5_2_sum_w not found INFO net.py: 88: fpn_inner_res5_2_sum_b not found INFO net.py: 88: fpn_inner_res4_5_sum_lateral_w not found INFO net.py: 88: fpn_inner_res4_5_sum_lateral_b not found INFO net.py: 88: fpn_inner_res3_3_sum_lateral_w not found INFO net.py: 88: fpn_inner_res3_3_sum_lateral_b not found INFO net.py: 88: fpn_res5_2_sum_w not found INFO net.py: 88: fpn_res5_2_sum_b not found INFO net.py: 88: fpn_res4_5_sum_w not found INFO net.py: 88: fpn_res4_5_sum_b not found INFO net.py: 88: fpn_res3_3_sum_w not found INFO net.py: 88: fpn_res3_3_sum_b not found INFO net.py: 88: fpn_6_w not found INFO net.py: 88: fpn_6_b not found INFO net.py: 88: fpn_7_w not found INFO net.py: 88: fpn_7_b not found INFO net.py: 88: retnet_cls_conv_n0_fpn3_w not found INFO net.py: 88: retnet_cls_conv_n0_fpn3_b not found INFO net.py: 88: retnet_cls_conv_n1_fpn3_w not found INFO net.py: 88: retnet_cls_conv_n1_fpn3_b not found INFO net.py: 88: retnet_cls_conv_n2_fpn3_w not found INFO net.py: 88: retnet_cls_conv_n2_fpn3_b not found INFO net.py: 88: retnet_cls_conv_n3_fpn3_w not found INFO net.py: 88: retnet_cls_conv_n3_fpn3_b not found INFO net.py: 88: retnet_cls_pred_fpn3_w not found INFO net.py: 88: retnet_cls_pred_fpn3_b not found INFO net.py: 88: retnet_bbox_conv_n0_fpn3_w not found INFO net.py: 88: retnet_bbox_conv_n0_fpn3_b not found INFO net.py: 88: retnet_bbox_conv_n1_fpn3_w not found INFO net.py: 88: retnet_bbox_conv_n1_fpn3_b not found INFO net.py: 88: retnet_bbox_conv_n2_fpn3_w not found INFO net.py: 88: retnet_bbox_conv_n2_fpn3_b not found INFO net.py: 88: retnet_bbox_conv_n3_fpn3_w not found INFO net.py: 88: retnet_bbox_conv_n3_fpn3_b not found INFO net.py: 88: retnet_bbox_pred_fpn3_w not found INFO net.py: 88: retnet_bbox_pred_fpn3_b not found E0714 21:52:44.576812 23172 operator_schema.cc:72] Input index 0 and output idx 0 (gpu_0/res3_0_branch2a_w_grad) are set to be in-place but this is actually not supported by op Copy Original python traceback for operator 1716 in network
File "tools/train_net.py", line 110, in main
File "/home/jinghan/Research/detectron/detectron/utils/train.py", line 54, in train_model
File "/home/jinghan/Research/detectron/detectron/utils/train.py", line 139, in create_model
File "/home/jinghan/Research/detectron/detectron/modeling/model_builder.py", line 124, in create
File "/home/jinghan/Research/detectron/detectron/modeling/model_builder.py", line 100, in retinanet
File "/home/jinghan/Research/detectron/detectron/modeling/model_builder.py", line 360, in build_generic_retinanet_model
File "/home/jinghan/Research/detectron/detectron/modeling/optimizer.py", line 44, in build_data_parallel_model
File "/home/jinghan/Research/detectron/detectron/modeling/optimizer.py", line 87, in _add_allreduce_graph
File "/home/yuz/work/caffe2/build/caffe2/python/muji.py", line 59, in Allreduce
File "/home/yuz/work/caffe2/build/caffe2/python/muji.py", line 244, in AllreduceFallback
Traceback (most recent call last):
File "tools/train_net.py", line 128, in
main()
File "tools/train_net.py", line 110, in main
checkpoints = detectron.utils.train.train_model()
File "/home/jinghan/Research/detectron/detectron/utils/train.py", line 59, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/home/jinghan/Research/detectron/detectron/utils/train.py", line 172, in setup_model_for_training
workspace.CreateNet(model.net)
File "/home/yuz/work/caffe2/build/caffe2/python/workspace.py", line 166, in CreateNet
StringifyProto(net), overwrite,
File "/home/yuz/work/caffe2/build/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:125] schema->Verify(operator_def). Operator def did not pass schema checking: input: "gpu_0/res3_0_branch2a_w_grad" output: "gpu_0/res3_0_branch2a_w_grad" name: "" type: "Copy" device_option { device_type: 1 cuda_gpu_id: 0 }
retinanet
in exception above (most recent call last): File "tools/train_net.py", line 128, inAnd also, when I use only one GPU, there is no problem! And, using GPU No.2,3, also no problem.