THU-MIG / yolov10

YOLOv10: Real-Time End-to-End Object Detection [NeurIPS 2024]
https://arxiv.org/abs/2405.14458
GNU Affero General Public License v3.0
9.94k stars 988 forks source link

Question about `v10postprocess` #382

Open likelikeslike opened 3 months ago

likelikeslike commented 3 months ago

I am a little bit confused about the v10postprocess function.

https://github.com/THU-MIG/yolov10/blob/cd2f79c70299c9041fb6d19617ef1296f47575b1/ultralytics/utils/ops.py#L851-L864

In the code above, why the scores is re-selected in line 860 and the flatten operation also confused me.

If my understanding is correct, line 854 - line 858 select max_det predictions based on the maximum score of each prediction, resulting in different selected predictions. However, line 860 select predictions from scores.flatten(1), which may select one prediction multiple times.

For example:

scores = torch.tensor(
    [
        [
            [0.8, 0.2, 0, 0, 0],
            [0.5, 0.25, 0.1, 0.05, 0.05],
            [0.19, 0.18, 0.15, 0.1, 0.1],
            [0.12, 0.08, 0.08, 0.08, 0.05],
        ]
    ]
)
max_scores, index = torch.topk(scores.amax(dim=-1), 3, dim=-1)
scores_1 = torch.gather(scores, dim=1, index=index.unsqueeze(-1).repeat(1, 1, scores.shape[-1]))
scores_2, index_ = torch.topk(scores_1.flatten(1), 3, dim=-1)

>>> scores_1
tensor([[[0.8000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.2500, 0.1000, 0.0500, 0.0500],
         [0.1900, 0.1800, 0.1500, 0.1000, 0.1000]]])

>>> scores_2
tensor([[0.8000, 0.5000, 0.2500]])

In this example, suppose max_det = 3, and scores_1 will select first three predictions, but scores_2 will select the second prediction twice. This results in two identical bounding boxes with different classes after v10postprocess. Is this the expected behavior?

Please correct me if my understanding is wrong, and thanks for your explanation!

kristi700 commented 3 months ago

I have stumbled upon the same "error" in my code as well.

scores, index = torch.max(scores, dim=-1) seems to be the correct way, because considering the current implementation, we give some boxes false scores and labels.

 def v10postprocess(preds, max_det, nc=80):
    assert(4 + nc == preds.shape[-1])
    boxes, scores = preds.split([4, nc], dim=-1)
    max_scores = scores.amax(dim=-1)
    max_scores, index = torch.topk(max_scores, max_det, dim=-1)
    index = index.unsqueeze(-1)
    boxes = torch.gather(boxes, dim=1, index=index.repeat(1, 1, boxes.shape[-1]))
    scores = torch.gather(scores, dim=1, index=index.repeat(1, 1, scores.shape[-1]))

    scores, index = torch.max(scores, dim=-1)
    labels = index % nc
    index = index // nc
    return boxes, scores, labels

Any answer from the authors would be appreciated!

Love-syntacticSugar commented 1 month ago

I am a little bit confused about the v10postprocess function.

https://github.com/THU-MIG/yolov10/blob/cd2f79c70299c9041fb6d19617ef1296f47575b1/ultralytics/utils/ops.py#L851-L864

In the code above, why the scores is re-selected in line 860 and the flatten operation also confused me.

If my understanding is correct, line 854 - line 858 select max_det predictions based on the maximum score of each prediction, resulting in different selected predictions. However, line 860 select predictions from scores.flatten(1), which may select one prediction multiple times.

For example:

scores = torch.tensor(
    [
        [
            [0.8, 0.2, 0, 0, 0],
            [0.5, 0.25, 0.1, 0.05, 0.05],
            [0.19, 0.18, 0.15, 0.1, 0.1],
            [0.12, 0.08, 0.08, 0.08, 0.05],
        ]
    ]
)
max_scores, index = torch.topk(scores.amax(dim=-1), 3, dim=-1)
scores_1 = torch.gather(scores, dim=1, index=index.unsqueeze(-1).repeat(1, 1, scores.shape[-1]))
scores_2, index_ = torch.topk(scores_1.flatten(1), 3, dim=-1)

>>> scores_1
tensor([[[0.8000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.2500, 0.1000, 0.0500, 0.0500],
         [0.1900, 0.1800, 0.1500, 0.1000, 0.1000]]])

>>> scores_2
tensor([[0.8000, 0.5000, 0.2500]])

In this example, suppose max_det = 3, and scores_1 will select first three predictions, but scores_2 will select the second prediction twice. This results in two identical bounding boxes with different classes after v10postprocess. Is this the expected behavior?

Please correct me if my understanding is wrong, and thanks for your explanation!

"I also noticed this issue, but I don’t think it was raised by the author of v10, as I found that this was already done in the non_max_suppression method of the ops.py script from v8. Specifically, the code is as follows:

if multi_label:
    i, j = torch.where(cls > conf_thres)
    x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)

And:

if rotated:
    boxes = torch.cat((x[:, :2] + c, x[:, 2:4], x[:, -1:]), dim=-1)  # xywhr
    i = nms_rotated(boxes, scores, iou_thres)
else:
    boxes = x[:, :4] + c  # boxes (offset by class)
    i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS

Doing it this way can lead to the probability of duplicate predicted boxes." I have some understanding of why this is done, but I won't go into details here.

likelikeslike commented 1 month ago

I am a little bit confused about the v10postprocess function. https://github.com/THU-MIG/yolov10/blob/cd2f79c70299c9041fb6d19617ef1296f47575b1/ultralytics/utils/ops.py#L851-L864

In the code above, why the scores is re-selected in line 860 and the flatten operation also confused me. If my understanding is correct, line 854 - line 858 select max_det predictions based on the maximum score of each prediction, resulting in different selected predictions. However, line 860 select predictions from scores.flatten(1), which may select one prediction multiple times. For example:

scores = torch.tensor(
    [
        [
            [0.8, 0.2, 0, 0, 0],
            [0.5, 0.25, 0.1, 0.05, 0.05],
            [0.19, 0.18, 0.15, 0.1, 0.1],
            [0.12, 0.08, 0.08, 0.08, 0.05],
        ]
    ]
)
max_scores, index = torch.topk(scores.amax(dim=-1), 3, dim=-1)
scores_1 = torch.gather(scores, dim=1, index=index.unsqueeze(-1).repeat(1, 1, scores.shape[-1]))
scores_2, index_ = torch.topk(scores_1.flatten(1), 3, dim=-1)

>>> scores_1
tensor([[[0.8000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.2500, 0.1000, 0.0500, 0.0500],
         [0.1900, 0.1800, 0.1500, 0.1000, 0.1000]]])

>>> scores_2
tensor([[0.8000, 0.5000, 0.2500]])

In this example, suppose max_det = 3, and scores_1 will select first three predictions, but scores_2 will select the second prediction twice. This results in two identical bounding boxes with different classes after v10postprocess. Is this the expected behavior? Please correct me if my understanding is wrong, and thanks for your explanation!

"I also noticed this issue, but I don’t think it was raised by the author of v10, as I found that this was already done in the non_max_suppression method of the ops.py script from v8. Specifically, the code is as follows:

if multi_label:
    i, j = torch.where(cls > conf_thres)
    x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)

And:

if rotated:
    boxes = torch.cat((x[:, :2] + c, x[:, 2:4], x[:, -1:]), dim=-1)  # xywhr
    i = nms_rotated(boxes, scores, iou_thres)
else:
    boxes = x[:, :4] + c  # boxes (offset by class)
    i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS

Doing it this way can lead to the probability of duplicate predicted boxes." I have some understanding of why this is done, but I won't go into details here.

Thank you for the explanation!

I believe the code you referred is about multi-label, i.e., one object (bounding box) can have multiple labels (e.g., a car can belong to both vehicle and car). This makes sense, as a single bounding box can have multiple predictions. However, in v10postprocess, I didn’t notice any specific configuration for multi-label in this function.

Regarding the rotated, I haven't looked this part of YOLOv8 yet, so I don't have any thoughts on it at the moment.

Love-syntacticSugar commented 1 month ago

I am a little bit confused about the v10postprocess function. https://github.com/THU-MIG/yolov10/blob/cd2f79c70299c9041fb6d19617ef1296f47575b1/ultralytics/utils/ops.py#L851-L864

In the code above, why the scores is re-selected in line 860 and the flatten operation also confused me. If my understanding is correct, line 854 - line 858 select max_det predictions based on the maximum score of each prediction, resulting in different selected predictions. However, line 860 select predictions from scores.flatten(1), which may select one prediction multiple times. For example:

scores = torch.tensor(
    [
        [
            [0.8, 0.2, 0, 0, 0],
            [0.5, 0.25, 0.1, 0.05, 0.05],
            [0.19, 0.18, 0.15, 0.1, 0.1],
            [0.12, 0.08, 0.08, 0.08, 0.05],
        ]
    ]
)
max_scores, index = torch.topk(scores.amax(dim=-1), 3, dim=-1)
scores_1 = torch.gather(scores, dim=1, index=index.unsqueeze(-1).repeat(1, 1, scores.shape[-1]))
scores_2, index_ = torch.topk(scores_1.flatten(1), 3, dim=-1)

>>> scores_1
tensor([[[0.8000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.2500, 0.1000, 0.0500, 0.0500],
         [0.1900, 0.1800, 0.1500, 0.1000, 0.1000]]])

>>> scores_2
tensor([[0.8000, 0.5000, 0.2500]])

In this example, suppose max_det = 3, and scores_1 will select first three predictions, but scores_2 will select the second prediction twice. This results in two identical bounding boxes with different classes after v10postprocess. Is this the expected behavior? Please correct me if my understanding is wrong, and thanks for your explanation!

"I also noticed this issue, but I don’t think it was raised by the author of v10, as I found that this was already done in the non_max_suppression method of the ops.py script from v8. Specifically, the code is as follows:

if multi_label:
    i, j = torch.where(cls > conf_thres)
    x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)

And:

if rotated:
    boxes = torch.cat((x[:, :2] + c, x[:, 2:4], x[:, -1:]), dim=-1)  # xywhr
    i = nms_rotated(boxes, scores, iou_thres)
else:
    boxes = x[:, :4] + c  # boxes (offset by class)
    i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS

Doing it this way can lead to the probability of duplicate predicted boxes." I have some understanding of why this is done, but I won't go into details here.

Thank you for the explanation!

I believe the code you referred is about multi-label, i.e., one object (bounding box) can have multiple labels (e.g., a car can belong to both vehicle and car). This makes sense, as a single bounding box can have multiple predictions. However, in v10postprocess, I didn’t notice any specific configuration for multi-label in this function.

Regarding the rotated, I haven't looked this part of YOLOv8 yet, so I don't have any thoughts on it at the moment.

Yes, in v8, the multi_label branch is used to perform Non-Maximum Suppression (NMS) separately for each category. Although v10 does not explicitly mention multi_label, the issue you raised is essentially about doing the same thing. I think if you look at the source code of v8, you will understand it~ (Wish you all the best!)

likelikeslike commented 1 month ago

I am a little bit confused about the v10postprocess function. https://github.com/THU-MIG/yolov10/blob/cd2f79c70299c9041fb6d19617ef1296f47575b1/ultralytics/utils/ops.py#L851-L864

In the code above, why the scores is re-selected in line 860 and the flatten operation also confused me. If my understanding is correct, line 854 - line 858 select max_det predictions based on the maximum score of each prediction, resulting in different selected predictions. However, line 860 select predictions from scores.flatten(1), which may select one prediction multiple times. For example:

scores = torch.tensor(
    [
        [
            [0.8, 0.2, 0, 0, 0],
            [0.5, 0.25, 0.1, 0.05, 0.05],
            [0.19, 0.18, 0.15, 0.1, 0.1],
            [0.12, 0.08, 0.08, 0.08, 0.05],
        ]
    ]
)
max_scores, index = torch.topk(scores.amax(dim=-1), 3, dim=-1)
scores_1 = torch.gather(scores, dim=1, index=index.unsqueeze(-1).repeat(1, 1, scores.shape[-1]))
scores_2, index_ = torch.topk(scores_1.flatten(1), 3, dim=-1)

>>> scores_1
tensor([[[0.8000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.2500, 0.1000, 0.0500, 0.0500],
         [0.1900, 0.1800, 0.1500, 0.1000, 0.1000]]])

>>> scores_2
tensor([[0.8000, 0.5000, 0.2500]])

In this example, suppose max_det = 3, and scores_1 will select first three predictions, but scores_2 will select the second prediction twice. This results in two identical bounding boxes with different classes after v10postprocess. Is this the expected behavior? Please correct me if my understanding is wrong, and thanks for your explanation!

"I also noticed this issue, but I don’t think it was raised by the author of v10, as I found that this was already done in the non_max_suppression method of the ops.py script from v8. Specifically, the code is as follows:

if multi_label:
    i, j = torch.where(cls > conf_thres)
    x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)

And:

if rotated:
    boxes = torch.cat((x[:, :2] + c, x[:, 2:4], x[:, -1:]), dim=-1)  # xywhr
    i = nms_rotated(boxes, scores, iou_thres)
else:
    boxes = x[:, :4] + c  # boxes (offset by class)
    i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS

Doing it this way can lead to the probability of duplicate predicted boxes." I have some understanding of why this is done, but I won't go into details here.

Thank you for the explanation! I believe the code you referred is about multi-label, i.e., one object (bounding box) can have multiple labels (e.g., a car can belong to both vehicle and car). This makes sense, as a single bounding box can have multiple predictions. However, in v10postprocess, I didn’t notice any specific configuration for multi-label in this function. Regarding the rotated, I haven't looked this part of YOLOv8 yet, so I don't have any thoughts on it at the moment.

Yes, in v8, the multi_label branch is used to perform Non-Maximum Suppression (NMS) separately for each category. Although v10 does not explicitly mention multi_label, the issue you raised is essentially about doing the same thing. I think if you look at the source code of v8, you will understand it~ (Wish you all the best!)

Actually I don't quite understand what you mean by:

multi_label branch is used to perform NMS separately for each category.

From my understanding, YOLOv8’s multi_label option works by keeping all classes whose scores > conf_threshold when enabled, when disabled, it retains only the class with the highest score > conf_threshold. I can see the reason behind this, and typically, for single-label object detection, this option would be turned off.

Here is the code in v8: https://github.com/ultralytics/ultralytics/blob/5dcaa0aa06ad580d434a7adfe145fab13354ab5d/ultralytics/utils/ops.py#L266-L271

However, this still cannot explain why YOLOv10 would apply the multi-label's bounding box selection by default, especially since the official paper states that YOLOv10 performs single-label object detection on the MS COCO dataset. It doesn't make sense for multi-label to be the default in this case. Or at least it should provide an parameter as YOLOv8 does.

Love-syntacticSugar commented 1 month ago

I am a little bit confused about the v10postprocess function. https://github.com/THU-MIG/yolov10/blob/cd2f79c70299c9041fb6d19617ef1296f47575b1/ultralytics/utils/ops.py#L851-L864

In the code above, why the scores is re-selected in line 860 and the flatten operation also confused me. If my understanding is correct, line 854 - line 858 select max_det predictions based on the maximum score of each prediction, resulting in different selected predictions. However, line 860 select predictions from scores.flatten(1), which may select one prediction multiple times. For example:

scores = torch.tensor(
    [
        [
            [0.8, 0.2, 0, 0, 0],
            [0.5, 0.25, 0.1, 0.05, 0.05],
            [0.19, 0.18, 0.15, 0.1, 0.1],
            [0.12, 0.08, 0.08, 0.08, 0.05],
        ]
    ]
)
max_scores, index = torch.topk(scores.amax(dim=-1), 3, dim=-1)
scores_1 = torch.gather(scores, dim=1, index=index.unsqueeze(-1).repeat(1, 1, scores.shape[-1]))
scores_2, index_ = torch.topk(scores_1.flatten(1), 3, dim=-1)

>>> scores_1
tensor([[[0.8000, 0.2000, 0.0000, 0.0000, 0.0000],
         [0.5000, 0.2500, 0.1000, 0.0500, 0.0500],
         [0.1900, 0.1800, 0.1500, 0.1000, 0.1000]]])

>>> scores_2
tensor([[0.8000, 0.5000, 0.2500]])

In this example, suppose max_det = 3, and scores_1 will select first three predictions, but scores_2 will select the second prediction twice. This results in two identical bounding boxes with different classes after v10postprocess. Is this the expected behavior? Please correct me if my understanding is wrong, and thanks for your explanation!

"I also noticed this issue, but I don’t think it was raised by the author of v10, as I found that this was already done in the non_max_suppression method of the ops.py script from v8. Specifically, the code is as follows:

if multi_label:
    i, j = torch.where(cls > conf_thres)
    x = torch.cat((box[i], x[i, 4 + j, None], j[:, None].float(), mask[i]), 1)

And:

if rotated:
    boxes = torch.cat((x[:, :2] + c, x[:, 2:4], x[:, -1:]), dim=-1)  # xywhr
    i = nms_rotated(boxes, scores, iou_thres)
else:
    boxes = x[:, :4] + c  # boxes (offset by class)
    i = torchvision.ops.nms(boxes, scores, iou_thres)  # NMS

Doing it this way can lead to the probability of duplicate predicted boxes." I have some understanding of why this is done, but I won't go into details here.

Thank you for the explanation! I believe the code you referred is about multi-label, i.e., one object (bounding box) can have multiple labels (e.g., a car can belong to both vehicle and car). This makes sense, as a single bounding box can have multiple predictions. However, in v10postprocess, I didn’t notice any specific configuration for multi-label in this function. Regarding the rotated, I haven't looked this part of YOLOv8 yet, so I don't have any thoughts on it at the moment.

Yes, in v8, the multi_label branch is used to perform Non-Maximum Suppression (NMS) separately for each category. Although v10 does not explicitly mention multi_label, the issue you raised is essentially about doing the same thing. I think if you look at the source code of v8, you will understand it~ (Wish you all the best!)

Actually I don't quite understand what you mean by:

multi_label branch is used to perform NMS separately for each category.

From my understanding, YOLOv8’s multi_label option works by keeping all classes whose scores > conf_threshold when enabled, when disabled, it retains only the class with the highest score > conf_threshold. I can see the reason behind this, and typically, for single-label object detection, this option would be turned off.

Here is the code in v8: https://github.com/ultralytics/ultralytics/blob/5dcaa0aa06ad580d434a7adfe145fab13354ab5d/ultralytics/utils/ops.py#L266-L271

However, this still cannot explain why YOLOv10 would apply the multi-label's bounding box selection by default, especially since the official paper states that YOLOv10 performs single-label object detection on the MS COCO dataset. It doesn't make sense for multi-label to be the default in this case. Or at least it should provide an parameter as YOLOv8 does.

1.A thorough understanding of “multi_label branch is used to perform NMS separately for each category” requires a grasp of the following operations: https://github.com/ultralytics/ultralytics/blob/5dcaa0aa06ad580d434a7adfe145fab13354ab5d/ultralytics/utils/ops.py#L284C1-L293C44 ,particularly the addition of c to the xy coordinates. 2.As I haven't had the opportunity to review the original YOLOv10 paper, I'm uncertain about "the official paper states that YOLOv10 performs single-label object detection on the MS COCO dataset". If my understanding is correct, this seems to present a contradiction.

Love-syntacticSugar commented 1 month ago

"To be honest, whether it's v10 or v8, I believe there is an issue with keeping duplicate boxes.

For example: In an image, there might be an overlap between a 'football field' and a 'playground' in a certain area, where the football field is enclosed by the playground, and their center points are the same (this is common in the DOTA dataset). In this case, there's a chance that one anchor predicts both the football field and the playground at the same time. However, in reality, the playground is definitely larger than the football field, so I think having one box predict multiple objects is problematic (this might be the reason why v8 has a lower AP when predicting football fields in the DOTA dataset).

Fortunately, v8 later used Probiou for further filtering, so the probability of one box predicting two objects is relatively low. But v10 is different, as this issue occurs more frequently in v10.

These are just some of my personal thoughts, and I may be wrong. Please bear with me. I will also make some improvements in the future to see if a better alternative can be found."