Focal cost in matcher.py

IDEA-Research / DINO

[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"

Apache License 2.0

2.15k stars 232 forks source link

Focal cost in matcher.py #139

Open berceanbogdan opened 1 year ago

berceanbogdan commented 1 year ago

The class cost in matcher.py is computed as follows:

     # Compute the classification cost.
     neg_cost_class = (1 - alpha) * (out_prob ** gamma) * (-(1 - out_prob + 1e-8).log()) # line 79
     pos_cost_class = alpha * ((1 - out_prob) ** gamma) * (-(out_prob + 1e-8).log())      # line 80
     cost_class = pos_cost_class[:, tgt_ids] - neg_cost_class[:, tgt_ids]                # line 81

Definition of focal loss is as follows:

My question is why are both neg_cost_class and pos_cost_class computed and summed? The cost should always be computed on a target class and corresponding predicted probability for the same class. That means Y is always 1. The way I understand it, your code basically always adds both branches. Did I misinterpret something?

Thank you for the great work! It helped me a lot.

HaoZhang534 commented 1 year ago

We follow previous works to use focal cost which contains both positive and negative parts. We have a negative part because we do not only expect a prediction to have a high predicted probability for the positive class but also expect it to have low probabilities for negative classes. Note that we use sigmoid to output probabilities where each probability is independent of others. Therefore, we need to explicitly lower the negative ones.

berceanbogdan commented 1 year ago

I understand that is the reason we generally use the negative branch in the focal loss. What I don't understand is how you do that in your implementation, since in line 81 you, rightfully, filter out the negative samples, but you still add the negative branch. For comparison, this is the original DETR implementations (w/o focal loss):

# Compute the classification cost. Contrary to the loss, we don't use the NLL,
# but approximate it in 1 - proba[target class].
# The 1 is a constant that doesn't change the matching, it can be ommitted.
cost_class = -out_prob[:, tgt_ids]

They use only the positive branch of the NLL (actually the aproximation of the NLL) and that makes sense to me.

Any thoughts?

HaoZhang534 commented 1 year ago

@berceanbogdan We do not filter our negative examples. cost_class is a matrix contains both positive and negative costs.

ShijieVVu commented 1 year ago

I understand that is the reason we generally use the negative branch in the focal loss. What I don't understand is how you do that in your implementation, since in line 81 you, rightfully, filter out the negative samples, but you still add the negative branch. For comparison, this is the original DETR implementations (w/o focal loss):
# Compute the classification cost. Contrary to the loss, we don't use the NLL,
# but approximate it in 1 - proba[target class].
# The 1 is a constant that doesn't change the matching, it can be ommitted.
cost_class = -out_prob[:, tgt_ids]
They use only the positive branch of the NLL (actually the aproximation of the NLL) and that makes sense to me.

Any thoughts?

If only positive cost term is used, when target positive probability is close to 1, its gradient will be close to zero and does about zero effect in class matching. If negated positive probability is used, its gradient will always be 1.

ShijieVVu commented 1 year ago

The good thing of focal cost is that it is very sensitive when p is close to 0 or 1, meaning class will be matched better in later stages.