Closed Tangshitao closed 6 years ago
The first step is to understand where execution time is spent. To do that, use MODEL.EXECUTION_TYPE prof_dag
:
python2 tools/test_net.py --cfg configs/12_2017_baselines/rpn_R-50-C4_1x.yaml TEST.WEIGHTS https://s3-us-west-2.amazonaws.com/detectron/35998355/12_2017_baselines/rpn_R-50-C4_1x.yaml.08_00_43.njH5oD9L/output/train/coco_2014_train:coco_2014_valminusminival/rpn/model_final.pkl TEST.DATASETS "('coco_2014_minival',)" MODEL.EXECUTION_TYPE prof_dag
Output after exiting:
I0403 07:36:22.714701 547318 prof_dag_net.cc:188] Measured operators over 84 net runs.
I0403 07:36:22.714779 547318 prof_dag_net.cc:205] Mean time in operator per run (stddev):
I0403 07:36:22.714784 547318 prof_dag_net.cc:209] 12.9896 ms/run ( 3.27039 ms/run) Op count per run: 43 AffineChannel
I0403 07:36:22.714797 547318 prof_dag_net.cc:209] 40.543 ms/run ( 9.98674 ms/run) Op count per run: 46 Conv
I0403 07:36:22.714802 547318 prof_dag_net.cc:209] 0.467649 ms/run ( 0.175602 ms/run) Op count per run: 1 MaxPool
I0403 07:36:22.714807 547318 prof_dag_net.cc:209] 238.597 ms/run ( 163.352 ms/run) Op count per run: 1 Python
I0403 07:36:22.714812 547318 prof_dag_net.cc:209] 8.17472 ms/run ( 1.04102 ms/run) Op count per run: 41 Relu
I0403 07:36:22.714818 547318 prof_dag_net.cc:209] 0.0666656 ms/run ( 0.143383 ms/run) Op count per run: 1 Sigmoid
I0403 07:36:22.714823 547318 prof_dag_net.cc:209] 0.00641894 ms/run (0.00590622 ms/run) Op count per run: 1 StopGradient
I0403 07:36:22.714828 547318 prof_dag_net.cc:209] 5.41326 ms/run ( 0.798915 ms/run) Op count per run: 13 Sum
If you compare this profile between the C4 and FPN models, you'll see that in fact the FPN model does spend more time executing the Conv
op but the C4 model spends significantly more time executing a Python
op. In this case, based on my background knowledge I can hypothesize that the difference in perf is due to nms
in the Python op implemented in lib.ops.GenerateProposalsOp
. The issue is that the nms
function has O(n^2)
runtime where n
is the number of proposals. The FPN version runs nms
separately for each pyramid level with a relatively small number of proposals per level (at most 2000 by default). The C4 version runs nms
on a relatively large number of proposals that are from the same level (12000 by default). The quadratic runtime behavior of nms
makes a big difference here.
If you set TEST.RPN_PRE_NMS_TOP_N
to a smaller value, such as 10000 or even 5000, then you'll see faster runtimes (but possibly with lower proposal AR). E.g.:
python2 tools/test_net.py --cfg configs/12_2017_baselines/rpn_R-50-C4_1x.yaml TEST.WEIGHTS https://s3-us-west-2.amazonaws.com/detectron/35998355/12_2017_baselines/rpn_R-50-C4_1x.yaml.08_00_43.njH5oD9L/output/train/coco_2014_train:coco_2014_valminusminival/rpn/model_final.pkl TEST.DATASETS "('coco_2014_minival',)" MODEL.EXECUTION_TYPE prof_dag TEST.RPN_PRE_NMS_TOP_N 10000
As shown in the table, for RPN, FPN inference time is faster than faster RCNN. From my point of view, FPN adds more convolution operations, so I think it is strange. Can anyone explain this?