Open Computational-Camera opened 5 years ago
@Computational-Camera I'm a bit surprised that it takes 40% of the computation time. I think what's happening is that there is a sync point there because we move parts from the GPU to the CPU in https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py#L131 , and this is wrongly making it look like it takes longer than what it should.
Could you run your code using CUDA_LAUNCH_BLOCKING=1 python tools/my_script.py
to re-check if the amount of time spent in filter_results
is indeed that large?
@fmassa In fact, I found that loop at maskrcnn-benchmark/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py line 109 is where causes problem. The computational time can be reduced dramatically if I reduce the class number. More specifically, inds = inds_all[:, j].nonzero().squeeze(1) takes about 10ms after 81 iterations.
Thanks for the pointer. Question: are you training on COCO?
One way of speeding up the evaluation is to increase the score_thresh
https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/modeling/roi_heads/box_head/inference.py#L30
by changing it in
https://github.com/facebookresearch/maskrcnn-benchmark/blob/d28845e112de36781b2b5f7217a34b2b62de8d2f/maskrcnn_benchmark/config/defaults.py#L167
Unfortunately, modifying the threshold does not help. it seems that .nonzero().squeeze(1) operation is not optimized.
I still think that you are seeing some synchronization happening which is not showing the real bottleneck, because nonzero
has a sync point.
Did you try running the runtime checkings with CUDA_LAUNCH_BLOCKING=1
?
In my case, there is only a single GPU.
even with a single GPU, cuda launches are async by nature, so you would still see the results change by adding CUDA_LAUNCH_BLOCKING=1
It seems that the filter_results function in the final inference process (roi_box_head) consumes around 40% of computation time of the whole inference time. Is it possible to accelerate it ?