ZFTurbo / Weighted-Boxes-Fusion

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.
MIT License
1.7k stars 237 forks source link

Add simple confidence averaging strategy #25

Closed i-aki-y closed 3 years ago

i-aki-y commented 3 years ago

Hi, @ZFTurbo. Now, I'm reading your paper and this repo, and I might identify the cause of the issue discussed in #10. I am describing what a cause is in the following sections.

To fix this issue, I made a PR that proposes another conf_type. Note that this modification can change confidence score, but not any bbox coordinates.

I'm not an expert, so I may have missed something important. If so, please feel free to point it out. I appreciate it if you review this, and I hope this PR will contribute to the community.

Problem

The weighted_boxes_fusion can return confidence score larger than 1.0 regardless of the allows_overflow parameter.

This is a reproducable example.

from ensemble_boxes import *

boxes_list = [[
    [0.1, 0.1, 0.2, 0.2],
    [0.1, 0.1, 0.2, 0.2],

],[
    [0.3, 0.3, 0.4, 0.4],
]]
scores_list = [[1.0, 1.0], [0.5]]
labels_list = [[0, 0], [0]]
weights = [2, 1]

iou_thr = 0.5
skip_box_thr = 0.0001
sigma = 0.1

boxes, scores, labels = weighted_boxes_fusion(boxes_list, scores_list, labels_list, weights=weights, iou_thr=iou_thr, skip_box_thr=skip_box_thr, allows_overflow=True)
print(scores)
# [1.33333337 0.16666667]

boxes, scores, labels = weighted_boxes_fusion(boxes_list, scores_list, labels_list, weights=weights, iou_thr=iou_thr, skip_box_thr=skip_box_thr, allows_overflow=False)
print(scores)
# [1.33333337 0.16666667]

Why?

According to the paper^[1], the confidence score is given by the eq.(1) and (5). Combining the two equations, we can get the following equation.

C = \frac{\sum_i^TC_i}{T} \frac{min(W, T))}{W}.

Since the current implementation has taken into account the weights, we should rewrite the equation. If we expand min(...) part explicitly, the resulting equation looks like this:

\begin{align*}
C &= \frac{\sum_i^T w_i C_i}{T} \frac{min(W, T))}{W}, \\
  &= \begin{cases} 
        \frac{\sum_i^T w_i c_i}{T}, & W \le T \\
        \frac{\sum_i^T w_i c_i}{W}, & W > T
     \end{cases}
\end{align*},

where the w_i is a weight per box, and the W denotes the total weights over the models (weight.sum()).

The last equation implies that the result is unbounded if w_i's are not unbounded. In the above case, we can calculate C = 2*1 + 2*1 / 3 = 1.33..., (W = 3, T = 2). This is why the result can be larger than 1.0.

A proposal to fix

In this PR, I propose the following simple weighted average.

C = \frac{\sum_i^Tw_iC_i}{\sum_i^T{w_i}} 

Note that the denominator is a weights sum over the boxes of a cluster (not over the models weight.sum()). Apparently, this is bounded in [0, 1] if the input scores are bounded in [0, 1].

Note that using the normalized weight (to make sure weight.sum == 1) does not fix the issue. If you use normalized weight, the resulting scores will be underestimated by dividing T or W.

This is a weighted_avg version example.

boxes_list = [[
    [0.1, 0.1, 0.2, 0.2],
    [0.1, 0.1, 0.2, 0.2],

],[
    [0.3, 0.3, 0.4, 0.4],
]]
scores_list = [[1.0, 1.0], [0.5]]
labels_list = [[0, 0], [0]]
weights = [2, 1]

iou_thr = 0.5
skip_box_thr = 0.0001
sigma = 0.1

boxes, scores, labels = weighted_boxes_fusion(boxes_list, scores_list, labels_list, weights=weights, iou_thr=iou_thr, skip_box_thr=skip_box_thr, allows_overflow=False, conf_type='weighted_avg')
print(scores)
# [1.  0.5]

## use other scores
scores_list = [[0.8, 1.0], [0.5]]
boxes, scores, labels = weighted_boxes_fusion(boxes_list, scores_list, labels_list, weights=weights, iou_thr=iou_thr, skip_box_thr=skip_box_thr, allows_overflow=False, conf_type='weighted_avg')
print(scores)
# [0.89999998 0.5]

Performance

Sorry, I did not conduct any performance experiments. So, I have no idea how much of an impact this fix will have on performance at this point.

[1] https://arxiv.org/abs/1910.13302

ZFTurbo commented 3 years ago

Great analysis. At first view your proposal is the right way to deal with weights. I'll look into it, check performance and most likely approve the PR.

i-aki-y commented 3 years ago

Thank you for your reply. And I have done a minor fix in a test code.

ZFTurbo commented 3 years ago

I found problem with your approach. Let's say you have 2 models. One model predict box and second model doesn't predict box at the same place. First model says "there is box" and second model says "there is no box". You need to account such case. In my method it's accounted in final division. But as I can see now, thanks for your great example, my WBF method only works right in cases when each claster contains at most one box from each independent model. If there are more boxes from same model answers became incorrect.

So the right answer for your example not "[1. 0.5]", but "[0.666666666 0.16666667]". None of the methods gives right answer now. ) I think your method can be fixed by adding virtual boxes to clasters from absent models with zero confidence score.

i-aki-y commented 3 years ago

Thank you for your review!

OK, I see your point. As you said, my approach does not include the effect of models that do not predict the box. I think it is an interesting point because this seems to be an inherent problem in the multi-model ensemble approach. Your virtual box idea is also very interesting. If I got your point correctly, would the result become like this?

C = \frac{\sum_i^Tw_iC_i}{\sum_i^T{w_i} + \sum_i^A w_i} 

where A is a number of "absent" models that predict "there is no box" in the cluster.

In this case, the result of my example will become "[0.8, 0.166...]". This improves the result but seems not perfect.

We can think of another approach that takes into account the absent models.

C = \frac{\sum_i^Tw_iC_i}{\sum_i^T{w_i}} \frac{\sum_i^S w_i}{W} 

where S is the number of models that appear in a cluster, W is the same as the above discussion. This version can be interpreted as a hybrid of two types of averages: box-wise average (first term) and model-wise average (second term).

The second equation gives the result of "[0.666..., 0.166...]".

But I know it is difficult to discuss in only one example.

Now, I am thinking of trying to implement both of them and revise the PR.

I would also appreciate it if I get any comments or suggestions.

ZFTurbo commented 3 years ago

I think both your proposals are good. Second is closer to what we have now. 1st must be tested for performance. If possible it's better to have both as option in code (to test it further in competitions and benchmarks). )

i-aki-y commented 3 years ago

Thanks again for your comment.

Now, I revised PR.

I introduced two conf_type options (hybrid version: box_and_model_avg, virtual zero conf box version: absent_model_aware_avg), while the default is still original avg. In the modification, I made each input box store a model index in the prefilter_boxes() function. And in the rescale phase, I used the model index to estimate model-wise average and absent models. I also revised test cases to be able to check different conf_type results.

I show some examples for different conf_type results below. The results seem to work as expected. But the performance is still an important factor.

I'm happy if someone checks this modification's performance in benchmarks or competitions.


boxes_list = [
    [
        [0.10, 0.10, 0.50, 0.50], # cluster 2
        [0.11, 0.11, 0.51, 0.51], # cluster 2
        [0.60, 0.60, 0.80, 0.80], # cluster 1

    ],
    [
        [0.59, 0.59, 0.79, 0.79], # cluster 1
        [0.61, 0.61, 0.81, 0.81], # cluster 1
        [0.80, 0.80, 0.90, 0.90], # cluster 3
    ],
]

scores_list = [[0.9, 0.8, 0.7], [0.85, 0.75, 0.65]]
labels_list = [[1, 1, 1], [1, 1, 0]]
weights = [2, 1]
iou_thr = 0.5
skip_box_thr = 0.0001

boxes, scores, labels = weighted_boxes_fusion(
    boxes_list,
    scores_list,
    labels_list,
    weights=weights,
    iou_thr=iou_thr,
    skip_box_thr=skip_box_thr,
    conf_type='avg'
)
print(scores)

boxes, scores, labels = weighted_boxes_fusion(
    boxes_list,
    scores_list,
    labels_list,
    weights=weights,
    iou_thr=iou_thr,
    skip_box_thr=skip_box_thr,
    conf_type='box_and_model_avg'
)
print(scores)

boxes, scores, labels = weighted_boxes_fusion(
    boxes_list,
    scores_list,
    labels_list,
    weights=weights,
    iou_thr=iou_thr,
    skip_box_thr=skip_box_thr,
    conf_type='absent_model_aware_avg'
)
print(scores)

# avg
# cluster order: [2, 1, 3]
# scores: [1.13333333 1.         0.21666667]

# box_and_model_avg
# cluster order: [1, 2, 3]
# scores: [0.75       0.56666666 0.21666667]

# absent_model_aware_avg
# cluster order: [1, 2, 3]
# scores: [0.75       0.68000001 0.21666667]
ZFTurbo commented 3 years ago

I ran example_oid.py benchmark on all 3 methods.

For default case with large IoU all og methods gives almost the same result.

WBF avg
mAP: 0.598214
Overall ensemble time for method: wbf: 276.93 sec
Ensemble [5] Weights: [1, 1, 1, 1, 1] Params: {'run_type': 'wbf', 'skip_box_thr': 1e-07, 'intersection_thr': 0.6, 'conf_type': 'avg', 'limit_boxes': 30000, 'verbose': True} mAP: 0.598214

WBF box_and_model_avg
mAP: 0.597953
Overall ensemble time for method: wbf: 311.21 sec
Ensemble [5] Weights: [1, 1, 1, 1, 1] Params: {'run_type': 'wbf', 'skip_box_thr': 1e-07, 'intersection_thr': 0.6, 'conf_type': 'box_and_model_avg', 'limit_boxes': 30000, 'verbose': True} mAP: 0.597953

WBF absent_model_aware_avg
Overall ensemble time for method: wbf: 308.49 sec
mAP: 0.598144
Ensemble [5] Weights: [1, 1, 1, 1, 1] Params: {'run_type': 'wbf', 'skip_box_thr': 1e-07, 'intersection_thr': 0.6, 'conf_type': 'absent_model_aware_avg', 'limit_boxes': 30000, 'verbose': True} mAP: 0.598144

Then I tried case when there are many boxes from the same model goes to the same cluster, using low IoU = 0.3.

WBF avg ['intersection_thr': 0.3]
mAP: 0.494097
Overall ensemble time for method: wbf: 293.55 sec
Ensemble [5] Weights: [1, 1, 1, 1, 1] Params: {'run_type': 'wbf', 'skip_box_thr': 1e-07, 'intersection_thr': 0.3, 'conf_type': 'avg', 'limit_boxes': 30000, 'verbose': True} mAP: 0.494097

WBF box_and_model_avg ['intersection_thr': 0.3]
mAP: 0.503016
Overall ensemble time for method: wbf: 307.19 sec
Ensemble [5] Weights: [1, 1, 1, 1, 1] Params: {'run_type': 'wbf', 'skip_box_thr': 1e-07, 'intersection_thr': 0.3, 'conf_type': 'box_and_model_avg', 'limit_boxes': 30000, 'verbose': True} mAP: 0.503016

WBF absent_model_aware_avg ['intersection_thr': 0.3]
Overall ensemble time for method: wbf: 297.96 sec
mAP: 0.502302
Ensemble [5] Weights: [1, 1, 1, 1, 1] Params: {'run_type': 'wbf', 'skip_box_thr': 1e-07, 'intersection_thr': 0.3, 'conf_type': 'absent_model_aware_avg', 'limit_boxes': 30000, 'verbose': True} mAP: 0.502302

New methods shows much better precision. Very nice )

In some cases new methods a little bit slower, but not too much.

i-aki-y commented 3 years ago

Great. That's good news. I'm glad to get that result. Thanks for all the advice so far.