facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
46.58k stars 5.52k forks source link

What does 'IoU score' mean after decoding? #495

Open ehdmso opened 1 year ago

ehdmso commented 1 year ago

Hi all. I am really appriciate the awesome work, segment anything!

By the way, I want to know the meaning of "IoU score" and how it calcualted.

As far as I know, IoU means intersection over union so that its value ranges 0~1.

Unfortunately, when I tried an example notebook file ( /segment-anything/notebooks/predictor_example.ipynb )

image

image

I got the score over 1.

So I wanted to find what that "score" means and I searched out all released code,

but i could not find it yet :(

My questions are:

  1. Can "IoU score" be over 1?
  2. Can the output of MLP work as a 'score' such as confidence score?

Thank you in advance.

heyoeyo commented 1 year ago

Yes, I've seen the IoU score go over 1.0 a few times as well.

Internally, the IoU score is generated by the iou_prediction_head layer, which is implemented as a multi-layer perceptron. This means there's nothing (internally) stopping the value from going over 1 (or even negative potentially?).

In the paper, they mention (on page 5, under the 'Resolving ambiguity' section) that:

To rank masks, the model predicts a confidence score (i.e., estimated IoU) for each mask

So it seems the value will tend to be between 0 and 1 only because it's trained to predict the IoU of it's own outputs (i.e. how close it thinks it's own mask predictions are compared to a hypothetical ground truth), and the IoU's it would have trained on would've always been in the 0 to 1 range, so it tends to reproduce that range.

So to answer your questions:

  1. Yes, it can go over 1 (though in my experience, it's often doesn't, and when it does it's just barely over 1)
  2. Yes, the original paper even refers to the IoU prediciton as a 'confidence score'
ehdmso commented 1 year ago

Thanks for your kind reply :) I really appreciate it.

I completely agree with your ideas.

By the way, I still have a question about the meaning of "IoU" score and confidence score.

As far as I understand, the definition of IoU score and the definition of confidence score are different. (for YOLO) (Even in YOLO, the confidence score is calculated using IoU score.)

I can understand that the value refers to the model's own predicted quality of mask.

So what I am curious about is 1) Does the value after passing through MLP layers (including nn.Linear's and activations) play a role in "intersection over union"? 2) Why is the output of the MLP (i.e., the predicted quality of the mask) named "intersection over union"?

I would be grateful if you could shed some light on these points. Thank you in advance for your help!

ehdmso commented 1 year ago

BTW, I have figured out the clue to the issue of score range on my own here

heyoeyo commented 1 year ago

Oh good catch on the disabled sigmoid on the MLP layer! Interesting that they don't use that to limit the output range (maybe the sigmoid makes the model harder to train? I'm not sure)

As for IoU vs confidence score, I think an important point is that while they refer to an 'IoU score' in many places, the code (as well as parts of the paper) refer to it more specifically as an 'IoU prediction' (for example, the last MLP layer is named iou_prediction_head and the output of the decoder uses the term iou_pred, instead of 'iou_score').

So it's not that the model is calculating the intersection-over-union value directly, instead it's trying to predict what the value would be, if it could calculate it. Ideally, it would calculate it directly, but this requires knowing the ground truth mask, which it can't have in normal use (if you had the ground truth mask to begin with, you wouldn't need to use the model!).

As for why it's named "intersection over union", instead of just "confidence score" or something similar, I think it comes from how the model is trained. In the paper (on page 17, under the 'Losses' section) they say:

The IoU prediction head is trained with mean-square-error loss between the IoU prediction and the predicted mask’s IoU with the ground truth mask

So it sounds like during training, they had the model generate it's mask prediction plus IoU prediction, then they directly calculated the 'real' IoU between the predicted mask and the ground truth mask that they're using to train the model and include that error in the loss function. With that IoU error being part of the loss function, the model weights would have updated in such a way that the error is minimized during training, meaning that the model's output IoU score would tend to match the 'real' IoU score. So even though the model's output IoU is used like a confidence score, it was trained to try to predict the actual IoU, and that's why (I'd guess) they named it that way.

ehdmso commented 1 year ago

Oh I sincerely appreciate your detailed and nice response. Your words makes sense and are logically clear and easy to grasp. Now I can understand the situation that IoU may be a value over 1, or the sigmoid option is set to False.

Once again, thank you for your helpful response.

venkatesh-thiru commented 1 year ago

Are there other works that describes models that predicts the IoU of segmentation masks, or is this some kind of novelty in SAM?

heyoeyo commented 1 year ago

It's not a segmentation model, but the original YOLO object detector model seems to do something similar. From the original paper, in the Unified Detection section, they mention:

Each bounding box consists of 5 predictions: x, y, w, h, and confidence... the confidence prediction represents the IOU between the predicted box and any ground truth box

I'm not familiar enough with ML to know if there's a name for this type of thing, but it does seem quite common for models to output some kind of confidence value to go with their other outputs, especially when they have multiple outputs to choose from (e.g. multiple masks for SAM, multiple bounding-boxes for YOLO).