I'm looking to train R-FCN with a Resnet-101 backbone end to end on the ms-coco dataset.
To start with, I wanted to make sure my setup was correct, so I used tutorial_1gpu_e2e_faster_rcnn_R-50-FPN.yaml to train Faster R-CNN with coco-2014 and this trained and evaluated on the dataset without any issues
I then modified the yaml file to train R-FCN with Resnet-101 and all other parameters remaining the same. The model seems to train fine but then I get a crash when the train_net.py script begins inference.
As mentioned by @ir413 on the issue above, I tried to set the flag FASTER_RCNN to true, but then that causes another crash which I've detailed below.
I would like to train my model and the RPN subnetwork and then use the trained RPN to produce the proposals during inference time. My understanding is that the setting the FASTER_RCNN flag in the config.py file is the only way to train a RPN and the backbone end to end and then during inference, use this trained RPN for region proposals. Further in the config.py file, it clearly states that RPN_ON should not be directly set in the config file here , but rather inferred from the other flags which set it, namely, FASTER_RCNN and RPN_ONLY.
I would really appreciate some pointers to resolve this since it appears that there is most likely a config flag that I'm missing @ir413 @rbgirshick . Thanks in advance!
Expected results
Model training successfully and evaluating on coco-minival-2014
Actual results
If FASTER_RCNN is set to False in the config file:
The model trains without any issues but crashes during inference. Here is the stack trace:
Traceback (most recent call last):
File "tools/train_net.py", line 132, in <module>
main()
File "tools/train_net.py", line 117, in main
test_model(checkpoints['final'], args.multi_gpu_testing, args.opts)
File "tools/train_net.py", line 127, in test_model
check_expected_results=True,
File "/localdisk/sairamsu/caffe2-training/detectron/detectron/core/test_engine.py", line 127, in run_inference
all_results = result_getter()
File "/localdisk/sairamsu/caffe2-training/detectron/detectron/core/test_engine.py", line 100, in result_getter
dataset_name, proposal_file = get_inference_dataset(i)
File "/localdisk/sairamsu/caffe2-training/detectron/detectron/core/test_engine.py", line 75, in get_inference_dataset
'If proposals are used, one proposal file must be specified for ' \
AssertionError: If proposals are used, one proposal file must be specified for each dataset
If FASTER_RCNN is set to True in the config file:
The model doesn't even train and crashes as shown below. Here is the stack trace:
Traceback (most recent call last):
File "tools/train_net.py", line 132, in <module>
main()
File "tools/train_net.py", line 114, in main
checkpoints = detectron.utils.train.train_model()
File "/localdisk/sairamsu/caffe2-training/detectron/detectron/utils/train.py", line 58, in train_model
setup_model_for_training(model, weights_file, output_dir)
File "/localdisk/sairamsu/caffe2-training/detectron/detectron/utils/train.py", line 178, in setup_model_for_training
workspace.CreateNet(model.net)
File "/nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 171, in CreateNet
StringifyProto(net), overwrite,
File "/nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 197, in CallWithExceptionIntercept
return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:46] blob != nullptr. op PSRoIPool: Encountered a non-existing input blob: gpu_0/rois
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x59 (0x7f9f7b890d69 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x57c (0x7f9faccebe1c in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #2: <unknown function> + 0x6747c (0x7f9eb974547c in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #3: <unknown function> + 0x6789e (0x7f9eb974589e in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #4: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_delete<caffe2::OperatorBase> > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_delete<caffe2::OperatorBase> > (*)(caffe2::OperatorDef const&, caffe2::Workspace*)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*) + 0xf (0x7f9fae8a006f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #5: <unknown function> + 0x144bb5f (0x7f9facce7b5f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: <unknown function> + 0x144eb49 (0x7f9facceab49 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x310 (0x7f9facceaf20 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x77f (0x7f9faccdb46f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x28b (0x7f9faccc5d5b in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9facccadf9 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #11: <unknown function> + 0x14305be (0x7f9facccc5be in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #12: <unknown function> + 0x143048f (0x7f9facccc48f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x659 (0x7f9faccc0659 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0xe4 (0x7f9facd1cef4 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x7f (0x7f9facd1d1af in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #16: <unknown function> + 0x552f2 (0x7f9fae89a2f2 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: <unknown function> + 0x897e8 (0x7f9fae8ce7e8 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)
<omitting python frames>
frame #35: __libc_start_main + 0xf0 (0x7f9fbf49a830 in /lib/x86_64-linux-gnu/libc.so.6)
It appears that the rois blob isn't found per this line in the stack trace RuntimeError: [enforce fail at operator.cc:46] blob != nullptr. op PSRoIPool: Encountered a non-existing input blob: gpu_0/rois
Detailed steps to reproduce
Here is the config file I wrote for training R-FCN end to end:
tutorial_1gpu_e2e_faster_rcnn_R-50-FPN.yaml
to train Faster R-CNN withcoco-2014
and this trained and evaluated on the dataset without any issuestrain_net.py
script begins inference.FASTER_RCNN
to true, but then that causes another crash which I've detailed below.I would like to train my model and the RPN subnetwork and then use the trained RPN to produce the proposals during inference time. My understanding is that the setting the
FASTER_RCNN
flag in theconfig.py
file is the only way to train a RPN and the backbone end to end and then during inference, use this trained RPN for region proposals. Further in theconfig.py
file, it clearly states thatRPN_ON
should not be directly set in the config file here , but rather inferred from the other flags which set it, namely,FASTER_RCNN
andRPN_ONLY
.I would really appreciate some pointers to resolve this since it appears that there is most likely a config flag that I'm missing @ir413 @rbgirshick . Thanks in advance!
Expected results
Model training successfully and evaluating on coco-minival-2014
Actual results
If
FASTER_RCNN
is set toFalse
in the config file: The model trains without any issues but crashes during inference. Here is the stack trace:If
FASTER_RCNN
is set toTrue
in the config file: The model doesn't even train and crashes as shown below. Here is the stack trace:It appears that the
rois
blob isn't found per this line in the stack traceRuntimeError: [enforce fail at operator.cc:46] blob != nullptr. op PSRoIPool: Encountered a non-existing input blob: gpu_0/rois
Detailed steps to reproduce
Here is the config file I wrote for training R-FCN end to end:
Here is the command I used
System information
PYTHONPATH
environment variable:/nfs/site/home/sairamsu/tf_training/models:/nfs/site/home/sairamsu/:/nfs/site/home/sairamsu/tf_training/models:/nfs/site/home/sairamsu/
python --version
output: 2.7.15