VDIGPKU / CBNet_caffe

Composite Backbone Network (AAAI20)
Apache License 2.0
408 stars 78 forks source link

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #3

Open carryyu opened 5 years ago

carryyu commented 5 years ago

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like: MODEL: TYPE: generalized_rcnn CONV_BODY: FPN.add_fpn_ResNet101_conv5_body NUM_CLASSES: 21 FASTER_RCNN: True CASCADE_ON: True CLS_AGNOSTIC_BBOX_REG: True # default: False NUM_GPUS: 2 SOLVER: WEIGHT_DECAY: 0.0001 LR_POLICY: steps_with_decay BASE_LR: 0.01 GAMMA: 0.1 MAX_ITER: 180000 STEPS: [0, 120000, 160000] FPN: FPN_ON: True MULTILEVEL_ROIS: True MULTILEVEL_RPN: True RESNETS: STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models TRANS_FUNC: bottleneck_transformation NUM_GROUPS: 64 WIDTH_PER_GROUP: 4 FAST_RCNN: ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head ROI_XFORM_METHOD: RoIAlign ROI_XFORM_RESOLUTION: 7 ROI_XFORM_SAMPLING_RATIO: 2 CASCADE_RCNN: ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head NUM_STAGE: 3 TEST_STAGE: 3 TEST_ENSEMBLE: True TRAIN: WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl DATASETS: ('coco_2014_train', 'coco_2014_valminusminival') SCALES: (800,) MAX_SIZE: 1333 IMS_PER_BATCH: 1 BATCH_SIZE_PER_IM: 512 RPN_PRE_NMS_TOP_N: 2000 # Per FPN level TEST: DATASETS: ('coco_2014_valminusminival',) SCALE: 800 MAX_SIZE: 1333 NMS: 0.5 RPN_PRE_NMS_TOP_N: 1000 # Per FPN level RPN_POST_NMS_TOP_N: 1000 OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace. WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last): WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body Traceback (most recent call last): File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in main() File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main checkpoints = detectron.utils.train.train_model() File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model setup_model_for_training(model, weights_file, output_dir) File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training workspace.CreateNet(model.net) File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet StringifyProto(net), overwrite, File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept return func(*args, *kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so) frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so) frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so) frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so) frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so) frame #5: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_delete > (caffe2::OperatorDef const&, caffe2::Workspace), std::unique_ptr<caffe2::OperatorBase, std::default_delete > ()(caffe2::OperatorDef const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so) frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so) frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr const&, caffe2::Workspace) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so) frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr const&, caffe2::Workspace) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so) frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr const&, caffe2::Workspace) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so) frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so) frame #12: std::_Function_handler<std::unique_ptr<caffe2::NetBase, std::default_delete > (std::shared_ptr const&, caffe2::Workspace), std::unique_ptr<caffe2::NetBase, std::default_delete > ()(std::shared_ptr const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr const&, caffe2::Workspace&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so) frame #13: caffe2::CreateNet(std::shared_ptr const&, caffe2::Workspace) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so) frame #14: caffe2::Workspace::CreateNet(std::shared_ptr const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so) frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so) frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6) frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python) ### What's more, I can train model on the original detectron.
carryyu commented 5 years ago

Your detectron version is a bit low。

PKUbahuangliuhe commented 5 years ago

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like: MODEL: TYPE: generalized_rcnn CONV_BODY: FPN.add_fpn_ResNet101_conv5_body NUM_CLASSES: 21 FASTER_RCNN: True CASCADE_ON: True CLS_AGNOSTIC_BBOX_REG: True # default: False NUM_GPUS: 2 SOLVER: WEIGHT_DECAY: 0.0001 LR_POLICY: steps_with_decay BASE_LR: 0.01 GAMMA: 0.1 MAX_ITER: 180000 STEPS: [0, 120000, 160000] FPN: FPN_ON: True MULTILEVEL_ROIS: True MULTILEVEL_RPN: True RESNETS: STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models TRANS_FUNC: bottleneck_transformation NUM_GROUPS: 64 WIDTH_PER_GROUP: 4 FAST_RCNN: ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head ROI_XFORM_METHOD: RoIAlign ROI_XFORM_RESOLUTION: 7 ROI_XFORM_SAMPLING_RATIO: 2 CASCADE_RCNN: ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head NUM_STAGE: 3 TEST_STAGE: 3 TEST_ENSEMBLE: True TRAIN: WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl DATASETS: ('coco_2014_train', 'coco_2014_valminusminival') SCALES: (800,) MAX_SIZE: 1333 IMS_PER_BATCH: 1 BATCH_SIZE_PER_IM: 512 RPN_PRE_NMS_TOP_N: 2000 # Per FPN level TEST: DATASETS: ('coco_2014_valminusminival',) SCALE: 800 MAX_SIZE: 1333 NMS: 0.5 RPN_PRE_NMS_TOP_N: 1000 # Per FPN level RPN_POST_NMS_TOP_N: 1000 OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace. WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last): WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body Traceback (most recent call last): File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in main() File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main checkpoints = detectron.utils.train.train_model() File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model setup_model_for_training(model, weights_file, output_dir) File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training workspace.CreateNet(model.net) File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet StringifyProto(net), overwrite, File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept return func(_args, kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const_, std::cxx11::basic_string<char, std::chartraits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so) frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so) frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so) frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so) frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so) frame #5: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace), std::unique_ptr<caffe2::OperatorBase, std::defaultdeletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace_)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so) frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so) frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr const&, caffe2::Workspace) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so) frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr const&, caffe2::Workspace) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so) frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr const&, caffe2::Workspace) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so) frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so) frame #12: std::_Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr const&, caffe2::Workspace), std::unique_ptr<caffe2::NetBase, std::defaultdeletecaffe2::NetBase > ()(std::sharedptr const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr const&, caffe2::Workspace&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so) frame #13: caffe2::CreateNet(std::shared_ptr const&, caffe2::Workspace) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so) frame #14: caffe2::Workspace::CreateNet(std::shared_ptr const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so) frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so) frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so)

frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6) frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

I note that you are using x101 instead of x152, the node name needs changed. res3_7 and res4_22 should be rewrited as res3_3 and res3_5

PKUbahuangliuhe commented 5 years ago

The lr should be changed linearly according to detectron if you reduce the number of gpu

carryyu commented 5 years ago

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful!

PKUbahuangliuhe commented 5 years ago

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful!

In detectron/modeling/ResNet.py, line 134, 'old_res3_7_sum'-->'old_res3_3_sum', line 158,'old_res4_35_sum'-->'old_res4_22_sum'. lr : 0.00125(since you use two gpus). train iter:180000*4

carryyu commented 5 years ago

The lr should be changed linearly according to detectron if you reduce the number of gpu

Thank you very much, can u show me how to change the node name? extremely grateful!

In detectron/modeling/ResNet.py, line 134, 'old_res3_7_sum'-->'old_res3_3_sum', line 158,'old_res4_35_sum'-->'old_res4_22_sum'. lr : 0.00125(since you use two gpus). train iter:180000*4

Thank u very much, it works!

David-19940718 commented 5 years ago

@PKUbahuangliuhe Thanks for your nice great work! Hi, author, can you tell me how to set the lr? if i have got one gpu, the lr should set how much? if i have got two? I want to know why we set this value can better for trianing. Looking forward to your replying. tks.

PKUbahuangliuhe commented 5 years ago

@PKUbahuangliuhe Thanks for your nice great work! Hi, author, can you tell me how to set the lr? if i have got one gpu, the lr should set how much? if i have got two? I want to know why we set this value can better for trianing. Looking forward to your replying. tks.

Firstly, we reduce the lr by half compared to the baseline. And you also need to reduce the lr linearly if you change the gpu number (according to the original detectron). For example, the baseline in Cascade R-CNN-X152 utilizes 8 gpus and lr is 0.01. And if you train Dual-X152 with 2 gpus, lr should be set as 0.01/2/(8/2). Note that the train iter also needs changed when the number of gpus is reduced due to the reduction of batch size.

lironghua318 commented 5 years ago

I don't have 8 GPUS, so I chang3 Num_GPUS to 2 and it raise this error. How can I fix it?

I use e2e_cascade_rcnn_X-101-64x4d-FPN_1x.yaml. I change it like: MODEL: TYPE: generalized_rcnn CONV_BODY: FPN.add_fpn_ResNet101_conv5_body NUM_CLASSES: 21 FASTER_RCNN: True CASCADE_ON: True CLS_AGNOSTIC_BBOX_REG: True # default: False NUM_GPUS: 2 SOLVER: WEIGHT_DECAY: 0.0001 LR_POLICY: steps_with_decay BASE_LR: 0.01 GAMMA: 0.1 MAX_ITER: 180000 STEPS: [0, 120000, 160000] FPN: FPN_ON: True MULTILEVEL_ROIS: True MULTILEVEL_RPN: True RESNETS: STRIDE_1X1: False # default True for MSRA; False for C2 or Torch models TRANS_FUNC: bottleneck_transformation NUM_GROUPS: 64 WIDTH_PER_GROUP: 4 FAST_RCNN: ROI_BOX_HEAD: fast_rcnn_heads.add_roi_2mlp_head ROI_XFORM_METHOD: RoIAlign ROI_XFORM_RESOLUTION: 7 ROI_XFORM_SAMPLING_RATIO: 2 CASCADE_RCNN: ROI_BOX_HEAD: cascade_rcnn_heads.add_roi_2mlp_head NUM_STAGE: 3 TEST_STAGE: 3 TEST_ENSEMBLE: True TRAIN: WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/FBResNeXt/X-101-64x4d.pkl DATASETS: ('coco_2014_train', 'coco_2014_valminusminival') SCALES: (800,) MAX_SIZE: 1333 IMS_PER_BATCH: 1 BATCH_SIZE_PER_IM: 512 RPN_PRE_NMS_TOP_N: 2000 # Per FPN level TEST: DATASETS: ('coco_2014_valminusminival',) SCALE: 800 MAX_SIZE: 1333 NMS: 0.5 RPN_PRE_NMS_TOP_N: 1000 # Per FPN level RPN_POST_NMS_TOP_N: 1000 OUTPUT_DIR: .

the error:

[W workspace.cc:170] Blob gpu_0/old_res3_7_sum not in the workspace. WARNING workspace.py: 222: Original python traceback for operator 383 in network generalized_rcnn in exception above (most recent call last): WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 53, in train_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 145, in create_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 127, in create WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 91, in generalized_rcnn WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 259, in build_generic_detection_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/optimizer.py", line 63, in _build_forward_graph WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/model_builder.py", line 189, in _single_gpu_build_func WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 64, in add_fpn_ResNet101_conv5_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/FPN.py", line 112, in add_fpn_onto_conv_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 48, in add_ResNet101_conv5_body WARNING workspace.py: 227: File "/home/lzy/diverse/CBNet/detectron/modeling/ResNet.py", line 145, in add_ResNet_convX_body Traceback (most recent call last): File "/home/lzy/diverse/CBNet/tools/train_net.py", line 133, in main() File "/home/lzy/diverse/CBNet/tools/train_net.py", line 115, in main checkpoints = detectron.utils.train.train_model() File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 58, in train_model setup_model_for_training(model, weights_file, output_dir) File "/home/lzy/diverse/CBNet/detectron/utils/train.py", line 179, in setup_model_for_training workspace.CreateNet(model.net) File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 181, in CreateNet StringifyProto(net), overwrite, File "/home/lzy/pytorch/build/caffe2/python/workspace.py", line 215, in CallWithExceptionIntercept return func(_args, kwargs) RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7sum frame #0: c10::ThrowEnforceNotMet(char const, int, char const_, std::cxx11::basic_string<char, std::chartraits, std::allocator > const&, void const) + 0x76 (0x7f916475ed36 in /home/lzy/pytorch/build/lib/libc10.so) frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace) + 0x3ff (0x7f9144b7bd2f in /home/lzy/pytorch/build/lib/libtorch.so) frame #2: + 0x3f68805 (0x7f914635b805 in /home/lzy/pytorch/build/lib/libtorch.so) frame #3: + 0x3f868eb (0x7f91463798eb in /home/lzy/pytorch/build/lib/libtorch.so) frame #4: + 0x3f8841e (0x7f914637b41e in /home/lzy/pytorch/build/lib/libtorch.so) frame #5: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_deletecaffe2::OperatorBase > (caffe2::OperatorDef const&, caffe2::Workspace), std::unique_ptr<caffe2::OperatorBase, std::defaultdeletecaffe2::OperatorBase > ()(caffe2::OperatorDef const&, caffe2::Workspace_)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace&&) + 0x23 (0x7f9164bf96a3 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #6: + 0x2786301 (0x7f9144b79301 in /home/lzy/pytorch/build/lib/libtorch.so) frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace, int) + 0x32a (0x7f9144b7a60a in /home/lzy/pytorch/build/lib/libtorch.so) frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr const&, caffe2::Workspace) + 0x17f3 (0x7f9144b74b93 in /home/lzy/pytorch/build/lib/libtorch.so) frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr const&, caffe2::Workspace) + 0x246 (0x7f9144b8c026 in /home/lzy/pytorch/build/lib/libtorch.so) frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr const&, caffe2::Workspace) + 0x9 (0x7f9144bb6989 in /home/lzy/pytorch/build/lib/libtorch.so) frame #11: + 0x27c5e2e (0x7f9144bb8e2e in /home/lzy/pytorch/build/lib/libtorch.so) frame #12: std::_Function_handler<std::unique_ptr<caffe2::NetBase, std::default_deletecaffe2::NetBase > (std::shared_ptr const&, caffe2::Workspace), std::unique_ptr<caffe2::NetBase, std::defaultdeletecaffe2::NetBase > ()(std::sharedptr const&, caffe2::Workspace)>::_M_invoke(std::_Any_data const&, std::shared_ptr const&, caffe2::Workspace&&) + 0x23 (0x7f9144bb8ce3 in /home/lzy/pytorch/build/lib/libtorch.so) frame #13: caffe2::CreateNet(std::shared_ptr const&, caffe2::Workspace) + 0x847 (0x7f9144bc3117 in /home/lzy/pytorch/build/lib/libtorch.so) frame #14: caffe2::Workspace::CreateNet(std::shared_ptr const&, bool) + 0x13c (0x7f9144bdf24c in /home/lzy/pytorch/build/lib/libtorch.so) frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x9f (0x7f9144be094f in /home/lzy/pytorch/build/lib/libtorch.so) frame #16: + 0x51f70 (0x7f9164beef70 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #17: + 0x521de (0x7f9164bef1de in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #18: + 0x99160 (0x7f9164c36160 in /home/lzy/pytorch/build/caffe2/python/caffe2_pybind11_state_gpu.so) frame #36: __libc_start_main + 0xf0 (0x7f9168059830 in /lib/x86_64-linux-gnu/libc.so.6) frame #37: + 0x107f (0x55e423b0507f in /home/lzy/anaconda2/envs/lzy/bin/python)

What's more, I can train model on the original detectron.

I note that you are using x101 instead of x152, the node name needs changed. res3_7 and res4_22 should be rewrited as res3_3 and res3_5

how about e2e_cascade_rcnn_R-50-FPN_1x.yaml? my gpu is 12G, can`t run 101