facebookresearch / Detectron

FAIR's research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet.
Apache License 2.0
26.22k stars 5.45k forks source link

R-FCN e2e trains on Coco 2014 but Inference Crashes : Problem with enabling proposals #816

Open sairams-intel opened 5 years ago

sairams-intel commented 5 years ago

I would like to train my model and the RPN subnetwork and then use the trained RPN to produce the proposals during inference time. My understanding is that the setting the FASTER_RCNN flag in the config.py file is the only way to train a RPN and the backbone end to end and then during inference, use this trained RPN for region proposals. Further in the config.py file, it clearly states that RPN_ON should not be directly set in the config file here , but rather inferred from the other flags which set it, namely, FASTER_RCNN and RPN_ONLY.

I would really appreciate some pointers to resolve this since it appears that there is most likely a config flag that I'm missing @ir413 @rbgirshick . Thanks in advance!

Expected results

Model training successfully and evaluating on coco-minival-2014

Actual results

If FASTER_RCNN is set to False in the config file: The model trains without any issues but crashes during inference. Here is the stack trace:

Traceback (most recent call last):
  File "tools/train_net.py", line 132, in <module>
    main()
  File "tools/train_net.py", line 117, in main
    test_model(checkpoints['final'], args.multi_gpu_testing, args.opts)
  File "tools/train_net.py", line 127, in test_model
    check_expected_results=True,
  File "/localdisk/sairamsu/caffe2-training/detectron/detectron/core/test_engine.py", line 127, in run_inference
    all_results = result_getter()
  File "/localdisk/sairamsu/caffe2-training/detectron/detectron/core/test_engine.py", line 100, in result_getter
    dataset_name, proposal_file = get_inference_dataset(i)
  File "/localdisk/sairamsu/caffe2-training/detectron/detectron/core/test_engine.py", line 75, in get_inference_dataset
    'If proposals are used, one proposal file must be specified for ' \
AssertionError: If proposals are used, one proposal file must be specified for each dataset

If FASTER_RCNN is set to True in the config file: The model doesn't even train and crashes as shown below. Here is the stack trace:

Traceback (most recent call last):
  File "tools/train_net.py", line 132, in <module>
    main()
  File "tools/train_net.py", line 114, in main
    checkpoints = detectron.utils.train.train_model()
  File "/localdisk/sairamsu/caffe2-training/detectron/detectron/utils/train.py", line 58, in train_model
    setup_model_for_training(model, weights_file, output_dir)
  File "/localdisk/sairamsu/caffe2-training/detectron/detectron/utils/train.py", line 178, in setup_model_for_training
    workspace.CreateNet(model.net)
  File "/nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 171, in CreateNet
    StringifyProto(net), overwrite,
  File "/nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 197, in CallWithExceptionIntercept
    return func(*args, **kwargs)
RuntimeError: [enforce fail at operator.cc:46] blob != nullptr. op PSRoIPool: Encountered a non-existing input blob: gpu_0/rois
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x59 (0x7f9f7b890d69 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libc10.so)
frame #1: caffe2::OperatorBase::OperatorBase(caffe2::OperatorDef const&, caffe2::Workspace*) + 0x57c (0x7f9faccebe1c in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #2: <unknown function> + 0x6747c (0x7f9eb974547c in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #3: <unknown function> + 0x6789e (0x7f9eb974589e in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/torch/lib/libcaffe2_detectron_ops_gpu.so)
frame #4: std::_Function_handler<std::unique_ptr<caffe2::OperatorBase, std::default_delete<caffe2::OperatorBase> > (caffe2::OperatorDef const&, caffe2::Workspace*), std::unique_ptr<caffe2::OperatorBase, std::default_delete<caffe2::OperatorBase> > (*)(caffe2::OperatorDef const&, caffe2::Workspace*)>::_M_invoke(std::_Any_data const&, caffe2::OperatorDef const&, caffe2::Workspace*) + 0xf (0x7f9fae8a006f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #5: <unknown function> + 0x144bb5f (0x7f9facce7b5f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #6: <unknown function> + 0x144eb49 (0x7f9facceab49 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #7: caffe2::CreateOperator(caffe2::OperatorDef const&, caffe2::Workspace*, int) + 0x310 (0x7f9facceaf20 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #8: caffe2::dag_utils::prepareOperatorNodes(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x77f (0x7f9faccdb46f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #9: caffe2::AsyncNetBase::AsyncNetBase(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x28b (0x7f9faccc5d5b in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #10: caffe2::AsyncSchedulingNet::AsyncSchedulingNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x9 (0x7f9facccadf9 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #11: <unknown function> + 0x14305be (0x7f9facccc5be in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #12: <unknown function> + 0x143048f (0x7f9facccc48f in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #13: caffe2::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, caffe2::Workspace*) + 0x659 (0x7f9faccc0659 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #14: caffe2::Workspace::CreateNet(std::shared_ptr<caffe2::NetDef const> const&, bool) + 0xe4 (0x7f9facd1cef4 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #15: caffe2::Workspace::CreateNet(caffe2::NetDef const&, bool) + 0x7f (0x7f9facd1d1af in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/../../torch/lib/libcaffe2.so)
frame #16: <unknown function> + 0x552f2 (0x7f9fae89a2f2 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)
frame #17: <unknown function> + 0x897e8 (0x7f9fae8ce7e8 in /nfs/pdx/home/sairamsu/miniconda3/envs/sairamsu-caffe2-detectron/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so)
<omitting python frames>
frame #35: __libc_start_main + 0xf0 (0x7f9fbf49a830 in /lib/x86_64-linux-gnu/libc.so.6)

It appears that the rois blob isn't found per this line in the stack trace RuntimeError: [enforce fail at operator.cc:46] blob != nullptr. op PSRoIPool: Encountered a non-existing input blob: gpu_0/rois

Detailed steps to reproduce

Here is the config file I wrote for training R-FCN end to end:

MODEL:
  TYPE: rfcn
  CONV_BODY: ResNet.add_ResNet101_conv5_body
  NUM_CLASSES: 81
  #FASTER_RCNN: True # Code crashes when set to True
NUM_GPUS: 1
SOLVER:
  WEIGHT_DECAY: 0.0001
  LR_POLICY: steps_with_decay
  BASE_LR: 0.0025
  GAMMA: 0.1
  MAX_ITER: 120000
  STEPS: [0, 90000, 100000]
RFCN:
  PS_GRID_SIZE: 3
TRAIN:
  WEIGHTS: https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/MSRA/R-101.pkl
  DATASETS: ('coco_2014_train', 'coco_2014_valminusminival')
  SCALES: (500,)
  MAX_SIZE: 833
  BATCH_SIZE_PER_IM: 256
  RPN_PRE_NMS_TOP_N: 2000
TEST:
  DATASETS: ('coco_2014_minival',)
  SCALE: 500
  MAX_SIZE: 833
  NMS: 0.5
  RPN_PRE_NMS_TOP_N: 1000  # Per FPN level
  RPN_POST_NMS_TOP_N: 1000

OUTPUT_DIR: /localdisk/sairamsu/caffe2-training/detectron/RFCN

Here is the command I used

python tools/train_net.py --cfg configs/getting_started/e2e_rfcn_R-101_1x.yaml OUTPUT_DIR RFCN/

System information

shady-cs15 commented 5 years ago

I'm also stuck on the same issue for a while. @ir413 @rbgirshick any suggestions?

ravising-h commented 4 years ago

Has this issue resolved?