facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.22k stars 5.45k forks source link

RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:29) #897

Open chenliqiong opened 5 years ago

chenliqiong commented 5 years ago

Expected results

Train RetinaNet-50-FPN with Detectron on my own datasets.

Actual results

RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:29)

/home/clq/software/anaconda2/envs/caffe2_py27/bin/python2.7 /home/clq/code/ObjectDetection/detectron/tools/train_net.py --cfg /home/clq/code/ObjectDetection/detectron/experiments/retinanet_R-50-FPN_1x.yaml OUTPUT_DIR /home/clq/code/ObjectDetection/detectron/experiments/result

Found Detectron ops lib: /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so [E init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. [E init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. [E init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU. Found Detectron ops lib: /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so INFO train_net.py: 106: Called with args: INFO train_net.py: 107: Namespace(cfg_file='/home/clq/code/ObjectDetection/detectron/experiments/retinanet_R-50-FPN_1x.yaml', multi_gpu_testing=False, opts=['OUTPUT_DIR', '/home/clq/code/ObjectDetection/detectron/experiments/result'], skip_test=False) INFO train_net.py: 113: Training with config: INFO train_net.py: 114: {'BBOX_XFORM_CLIP': 4.135166556742356, 'CLUSTER': {'ON_CLUSTER': False}, 'DATA_LOADER': {'BLOBS_QUEUE_CAPACITY': 8, 'MINIBATCH_QUEUE_SIZE': 64, 'NUM_THREADS': 4}, 'DEDUP_BOXES': 0.0625, 'DOWNLOAD_CACHE': u'/tmp/detectron-download-cache', 'EPS': 1e-14, 'EXPECTED_RESULTS': [], 'EXPECTED_RESULTS_ATOL': 0.005, 'EXPECTED_RESULTS_EMAIL': u'', 'EXPECTED_RESULTS_RTOL': 0.1, 'EXPECTED_RESULTS_SIGMA_TOL': 4, 'FAST_RCNN': {'CONV_HEAD_DIM': 256, 'MLP_HEAD_DIM': 1024, 'NUM_STACKED_CONVS': 4, 'ROI_BOX_HEAD': u'', 'ROI_XFORM_METHOD': u'RoIPoolF', 'ROI_XFORM_RESOLUTION': 14, 'ROI_XFORM_SAMPLING_RATIO': 0}, 'FPN': {'COARSEST_STRIDE': 128, 'DIM': 256, 'EXTRA_CONV_LEVELS': True, 'FPN_ON': True, 'MULTILEVEL_ROIS': False, 'MULTILEVEL_RPN': True, 'ROI_CANONICAL_LEVEL': 4, 'ROI_CANONICAL_SCALE': 224, 'ROI_MAX_LEVEL': 5, 'ROI_MIN_LEVEL': 2, 'RPN_ANCHOR_START_SIZE': 32, 'RPN_ASPECT_RATIOS': (0.5, 1, 2), 'RPN_MAX_LEVEL': 7, 'RPN_MIN_LEVEL': 3, 'USE_GN': False, 'ZERO_INIT_LATERAL': False}, 'GROUP_NORM': {'DIM_PER_GP': -1, 'EPSILON': 1e-05, 'NUM_GROUPS': 32}, 'KRCNN': {'CONV_HEAD_DIM': 256, 'CONV_HEAD_KERNEL': 3, 'CONV_INIT': u'GaussianFill', 'DECONV_DIM': 256, 'DECONV_KERNEL': 4, 'DILATION': 1, 'HEATMAP_SIZE': -1, 'INFERENCE_MIN_SIZE': 0, 'KEYPOINT_CONFIDENCE': u'bbox', 'LOSS_WEIGHT': 1.0, 'MIN_KEYPOINT_COUNT_FOR_VALID_MINIBATCH': 20, 'NMS_OKS': False, 'NORMALIZE_BY_VISIBLE_KEYPOINTS': True, 'NUM_KEYPOINTS': -1, 'NUM_STACKED_CONVS': 8, 'ROI_KEYPOINTS_HEAD': u'', 'ROI_XFORM_METHOD': u'RoIAlign', 'ROI_XFORM_RESOLUTION': 7, 'ROI_XFORM_SAMPLING_RATIO': 0, 'UP_SCALE': -1, 'USE_DECONV': False, 'USE_DECONV_OUTPUT': False}, 'MATLAB': u'matlab', 'MEMONGER': True, 'MEMONGER_SHARE_ACTIVATIONS': False, 'MODEL': {'BBOX_REG_WEIGHTS': (10.0, 10.0, 5.0, 5.0), 'CLS_AGNOSTIC_BBOX_REG': False, 'CONV_BODY': 'FPN.add_fpn_ResNet50_conv5_body', 'EXECUTION_TYPE': u'dag', 'FASTER_RCNN': False, 'KEYPOINTS_ON': False, 'MASK_ON': False, 'NUM_CLASSES': 4, 'RPN_ONLY': False, 'TYPE': 'retinanet'}, 'MRCNN': {'CLS_SPECIFIC_MASK': True, 'CONV_INIT': u'GaussianFill', 'DILATION': 2, 'DIM_REDUCED': 256, 'RESOLUTION': 14, 'ROI_MASK_HEAD': u'', 'ROI_XFORM_METHOD': u'RoIAlign', 'ROI_XFORM_RESOLUTION': 7, 'ROI_XFORM_SAMPLING_RATIO': 0, 'THRESH_BINARIZE': 0.5, 'UPSAMPLE_RATIO': 1, 'USE_FC_OUTPUT': False, 'WEIGHT_LOSS_MASK': 1.0}, 'NUM_GPUS': 8, 'OUTPUT_DIR': '/home/clq/code/ObjectDetection/detectron/experiments/result', 'PIXEL_MEANS': array([[[102.9801, 115.9465, 122.7717]]]), 'RESNETS': {'NUM_GROUPS': 1, 'RES5_DILATION': 1, 'SHORTCUT_FUNC': u'basic_bn_shortcut', 'STEM_FUNC': u'basic_bn_stem', 'STRIDE_1X1': True, 'TRANS_FUNC': u'bottleneck_transformation', 'WIDTH_PER_GROUP': 64}, 'RETINANET': {'ANCHOR_SCALE': 4, 'ASPECT_RATIOS': (1.0, 2.0, 0.5), 'BBOX_REG_BETA': 0.11, 'BBOX_REG_WEIGHT': 1.0, 'CLASS_SPECIFIC_BBOX': False, 'INFERENCE_TH': 0.05, 'LOSS_ALPHA': 0.25, 'LOSS_GAMMA': 2.0, 'NEGATIVE_OVERLAP': 0.4, 'NUM_CONVS': 4, 'POSITIVE_OVERLAP': 0.5, 'PRE_NMS_TOP_N': 1000, 'PRIOR_PROB': 0.01, 'RETINANET_ON': True, 'SCALES_PER_OCTAVE': 3, 'SHARE_CLS_BBOX_TOWER': False, 'SOFTMAX': False}, 'RFCN': {'PS_GRID_SIZE': 3}, 'RNG_SEED': 3, 'ROOT_DIR': '/home/clq/code/ObjectDetection/detectron', 'RPN': {'ASPECT_RATIOS': (0.5, 1, 2), 'RPN_ON': False, 'SIZES': (64, 128, 256, 512), 'STRIDE': 16}, 'SOLVER': {'BASE_LR': 0.01, 'GAMMA': 0.1, 'LOG_LR_CHANGE_THRESHOLD': 1.1, 'LRS': [], 'LR_POLICY': 'steps_with_decay', 'MAX_ITER': 10000, 'MOMENTUM': 0.9, 'SCALE_MOMENTUM': True, 'SCALE_MOMENTUM_THRESHOLD': 1.1, 'STEPS': [0, 5000, 8000], 'STEP_SIZE': 30000, 'WARM_UP_FACTOR': 0.3333333333333333, 'WARM_UP_ITERS': 500, 'WARM_UP_METHOD': u'linear', 'WEIGHT_DECAY': 0.0001, 'WEIGHT_DECAY_GN': 0.0}, 'TEST': {'BBOX_AUG': {'AREA_TH_HI': 32400, 'AREA_TH_LO': 2500, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'COORD_HEUR': u'UNION', 'ENABLED': False, 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False, 'SCORE_HEUR': u'UNION'}, 'BBOX_REG': True, 'BBOX_VOTE': {'ENABLED': False, 'SCORING_METHOD': u'ID', 'SCORING_METHOD_BETA': 1.0, 'VOTE_TH': 0.8}, 'COMPETITION_MODE': True, 'DATASETS': ('voc_2007_val',), 'DETECTIONS_PER_IM': 100, 'FORCE_JSON_DATASET_EVAL': False, 'GENERATE_PROPOSALS_ON_GPU': False, 'KPS_AUG': {'AREA_TH': 32400, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'ENABLED': False, 'HEUR': u'HM_AVG', 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False}, 'MASK_AUG': {'AREA_TH': 32400, 'ASPECT_RATIOS': (), 'ASPECT_RATIO_H_FLIP': False, 'ENABLED': False, 'HEUR': u'SOFT_AVG', 'H_FLIP': False, 'MAX_SIZE': 4000, 'SCALES': (), 'SCALE_H_FLIP': False, 'SCALE_SIZE_DEP': False}, 'MAX_SIZE': 1333, 'NMS': 0.5, 'PRECOMPUTED_PROPOSALS': False, 'PROPOSAL_FILES': (), 'PROPOSAL_LIMIT': 2000, 'RPN_MIN_SIZE': 0, 'RPN_NMS_THRESH': 0.7, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 10000, 'SCALE': 800, 'SCORE_THRESH': 0.05, 'SOFT_NMS': {'ENABLED': False, 'METHOD': u'linear', 'SIGMA': 0.5}, 'WEIGHTS': u''}, 'TRAIN': {'ASPECT_GROUPING': True, 'AUTO_RESUME': True, 'BATCH_SIZE_PER_IM': 64, 'BBOX_THRESH': 0.5, 'BG_THRESH_HI': 0.5, 'BG_THRESH_LO': 0.0, 'COPY_WEIGHTS': False, 'CROWD_FILTER_THRESH': 0.7, 'DATASETS': ('voc_2007_train',), 'FG_FRACTION': 0.25, 'FG_THRESH': 0.5, 'FREEZE_AT': 2, 'FREEZE_CONV_BODY': False, 'GENERATE_PROPOSALS_ON_GPU': False, 'GT_MIN_AREA': -1, 'IMS_PER_BATCH': 2, 'MAX_SIZE': 1333, 'PROPOSAL_FILES': (), 'RPN_BATCH_SIZE_PER_IM': 256, 'RPN_FG_FRACTION': 0.5, 'RPN_MIN_SIZE': 0, 'RPN_NEGATIVE_OVERLAP': 0.3, 'RPN_NMS_THRESH': 0.7, 'RPN_POSITIVE_OVERLAP': 0.7, 'RPN_POST_NMS_TOP_N': 2000, 'RPN_PRE_NMS_TOP_N': 12000, 'RPN_STRADDLE_THRESH': -1, 'SCALES': (800,), 'SNAPSHOT_ITERS': 80000, 'USE_FLIPPED': True, 'WEIGHTS': '/home/clq/code/ObjectDetection/detectron/experiments/R-50.pkl'}, 'USE_NCCL': False, 'VIS': False, 'VIS_TH': 0.9} INFO train_net.py: 207: Building model: retinanet WARNING cnn.py: 25: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information. WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0551738739014 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0511929988861 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0504541397095 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0508370399475 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0502688884735 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0515651702881 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.0507099628448 secs WARNING memonger.py: 55: NOTE: Executing memonger to optimize gradient memory [I memonger.cc:236] Remapping 122 using 26 shared blobs. INFO memonger.py: 97: Memonger memory optimization took 0.050724029541 secs WARNING workspace.py: 220: Original python traceback for operator 195 in network retinanet_init in exception above (most recent call last): WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 283, in WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 121, in main WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 130, in train_model WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 208, in create_model WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/model_builder.py", line 124, in create WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/model_builder.py", line 100, in retinanet WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/model_builder.py", line 360, in build_generic_retinanet_model WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/optimizer.py", line 63, in _build_forward_graph WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/model_builder.py", line 348, in _single_gpu_build_func WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/FPN.py", line 48, in add_fpn_ResNet50_conv5_body WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/FPN.py", line 104, in add_fpn_onto_conv_body WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/ResNet.py", line 40, in add_ResNet50_conv5_body WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/ResNet.py", line 99, in add_ResNet_convX_body WARNING workspace.py: 225: File "/home/clq/code/ObjectDetection/detectron/detectron/modeling/ResNet.py", line 252, in basic_bn_stem WARNING workspace.py: 225: File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/cnn.py", line 97, in Conv WARNING workspace.py: 225: File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/brew.py", line 108, in scope_wrapper WARNING workspace.py: 225: File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/helpers/conv.py", line 186, in conv WARNING workspace.py: 225: File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/helpers/conv.py", line 88, in _ConvBase WARNING workspace.py: 225: File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/model_helper.py", line 216, in create_param WARNING workspace.py: 225: File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/modeling/initializers.py", line 30, in create_param Traceback (most recent call last): File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 283, in main() File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 121, in main checkpoints = train_model() File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 130, in train_model model, start_iter, checkpoints, output_dir = create_model() File "/home/clq/code/ObjectDetection/detectron/tools/train_net.py", line 212, in create_model workspace.RunNetOnce(model.param_init_net) File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/workspace.py", line 234, in RunNetOnce StringifyProto(net), File "/home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/workspace.py", line 213, in CallWithExceptionIntercept return func(*args, *kwargs) RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:29) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f0f78208931 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f0f78207f8a in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libc10.so) frame #2: + 0x1290210 (0x7f0f1caee210 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so) frame #3: + 0x15d543c (0x7f0f1ce3343c in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so) frame #4: caffe2::EventCreateCUDA(caffe2::DeviceOption const&, caffe2::Event) + 0x3f (0x7f0f1ce31b8f in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so) frame #5: + 0x1c3f267 (0x7f0f51686267 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #6: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace) + 0x1f5 (0x7f0f5167f345 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #7: caffe2::FillerOp::FillerOp<caffe2::OperatorDef const&, caffe2::Workspace&>(caffe2::OperatorDef const&, caffe2::Workspace&) + 0x1f (0x7f0f1e2c417f in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so) frame #8: + 0x2a66de0 (0x7f0f1e2c4de0 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2_gpu.so) frame #9: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_delete > (caffe2::OperatorDef const&, caffe2::Workspace), std::unique_ptr<caffe2::OperatorBase, std::default_delete > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace) + 0xf (0x7f0f786aa66f in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so) frame #10: + 0x1c34fdf (0x7f0f5167bfdf in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #11: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace, int) + 0x310 (0x7f0f5167dbb0 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #12: caffe2::SimpleNet::SimpleNet(std::shared_ptr const&, caffe2::Workspace) + 0x2e2 (0x7f0f51674df2 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #13: + 0x1c316ae (0x7f0f516786ae in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #14: std::_Function_handler<std::unique_ptr<caffe2::NetBase, std::default_delete > (std::shared_ptr const&, caffe2::Workspace), std::unique_ptr<caffe2::NetBase, std::default_delete > ()(std::shared_ptr const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr const&, caffe2::Workspace) + 0xf (0x7f0f516536af in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #15: caffe2::CreateNet(std::shared_ptr const&, caffe2::Workspace) + 0x6c0 (0x7f0f51646150 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #16: caffe2::CreateNet(caffe2::NetDef const&, caffe2::Workspace*) + 0x89 (0x7f0f51646809 in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #17: caffe2::Workspace::RunNetOnce(caffe2::NetDef const&) + 0x1f (0x7f0f516aeb6f in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so) frame #18: + 0x5903f (0x7f0f786a403f in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so) frame #19: + 0x9426e (0x7f0f786df26e in /home/clq/software/anaconda2/envs/caffe2_py27/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #37: __libc_start_main + 0xe7 (0x7f0f83f57b97 in /lib/x86_64-linux-gnu/libc.so.6) frame #38: + 0x107f (0x55611a23f07f in /home/clq/software/anaconda2/envs/caffe2_py27/bin/python2.7) Process finished with exit code 1 ### Detailed steps to reproduce E.g.: ``` (caffe2_py27) clq@clq-Linux-System-Product:~/code/ObjectDetection/detectron$ python2 tools/train_net.py --cfg experiments/retinanet_R-50-FPN_1x.yaml OUTPUT_DIR experiments/result ``` ### System information * Operating system:Ubutun18.04 * Compiler version: ? * CUDA version: 9.0.176 * cuDNN version: 7.0.5 * NVIDIA driver version: 430 * GPU models (for all devices if they are not all the same): single GPU 1080ti * `PYTHONPATH` environment variable: ? * `python --version` output: Python 2.7.16 :: Anaconda, Inc. * Anything else that seems relevant: When I ran "python tools/infer_simple.py \ --cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml \ --output-dir /tmp/detectron-visualizations \ --image-ext jpg \ --wts https://dl.fbaipublicfiles.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \ demo", this commands worked well and inferenced Mask result. When it inferences, the GPU is involved. I don't know why it can't run train_net.py successfully.
SnowRipple commented 5 years ago

Had similar issue, changed NUM_GPU to 1 (instead of 8)