facebookresearch / maskrcnn-benchmark

Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.
MIT License
9.29k stars 2.5k forks source link

Optimization for filter_results #350

Open Computational-Camera opened 5 years ago

Computational-Camera commented 5 years ago

It seems that the filter_results function in the final inference process (roi_box_head) consumes around 40% of computation time of the whole inference time. Is it possible to accelerate it ?

fmassa commented 5 years ago

@Computational-Camera I'm a bit surprised that it takes 40% of the computation time. I think what's happening is that there is a sync point there because we move parts from the GPU to the CPU in https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py#L131 , and this is wrongly making it look like it takes longer than what it should.

Could you run your code using CUDA_LAUNCH_BLOCKING=1 python tools/my_script.py to re-check if the amount of time spent in filter_results is indeed that large?

Computational-Camera commented 5 years ago

@fmassa In fact, I found that loop at maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py line 109 is where causes problem. The computational time can be reduced dramatically if I reduce the class number. More specifically, inds = inds_all[:, j].nonzero().squeeze(1) takes about 10ms after 81 iterations.

fmassa commented 5 years ago

Thanks for the pointer. Question: are you training on COCO?

One way of speeding up the evaluation is to increase the score_thresh https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py#L30 by changing it in https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/config/defaults.py#L167

Computational-Camera commented 5 years ago

Unfortunately, modifying the threshold does not help. it seems that .nonzero().squeeze(1) operation is not optimized.

fmassa commented 5 years ago

I still think that you are seeing some synchronization happening which is not showing the real bottleneck, because nonzero has a sync point.

Did you try running the runtime checkings with CUDA_LAUNCH_BLOCKING=1?

Computational-Camera commented 5 years ago

In my case, there is only a single GPU.

fmassa commented 5 years ago

even with a single GPU, cuda launches are async by nature, so you would still see the results change by adding CUDA_LAUNCH_BLOCKING=1