Issues with BoxMatcher and loss

WongKinYiu / YOLO

An MIT rewrite of YOLOv9

MIT License

574 stars 63 forks source link

I saw the note in the readme about slower convergence and thought I'd try to help. These are the potential issues I've seen, though there may be others as well.

BoxMatcher:

When doing normalization of align_cls the target_matrix hasn't been adjusted to account for the topk mask, and which GTs actually won the duplicate step. The same with the iou_mat. This yields different align_cls values in some assignments when comparing to other YOLO implementations.
Duplicate assignments in YOLO MIT are filtered based on the full cost matrix. Duplicate resolution in other single shot detector variants appear to use the (C)IoU cost only. Not sure how this affects the training.
While the no box case was fixed in #88 , there still might be a rare issue if there are target boxes, but they are all too small to overlap with the anchors. This may not be an issue unless doing custom datasets/loaders though.

Loss

The CIoU is missing no_grad around the penalty term. Adding it may improve loss stability. See https://github.com/pytorch/vision/blob/945bdad7523806b15d3740ce6ace2fced9ef9d3b/torchvision/ops/ciou_loss.py#L62
The eps value of 1e-9 can cause instability if training with float16 or bfloat16. 1e-7 appears more stable.

Hi,

Thanks for raising this issue! I'm also working on figuring out why the convergence speed isn't as good as the stable version. Yesterday, I made some changes in commit fd5413f77d03f91b48eebba7dc1b98582bee93ad. If you have time, feel free to take a look.

Here’s a summary of the changes:

Adjusted EPS from e-9 to e-7
Switched to using the assignment matrix instead of the weighted matrix
Modified the filtering logic to prioritize checking for duplicate bounding boxes

I believe some of these changes align with what you mentioned in this issue. Moving forward, I'll add no_grad to CIoU and implement logic to filter out extremely small bounding boxes. I’ll also retrain the model to check if the convergence issue still persists.

The changes so far only impacted the loss by about 0.001, so I suspect the issue may still exist. Let’s continue troubleshooting together!

Best regards, Henry Tsui

WongKinYiu / YOLO

Issues with BoxMatcher and loss #103

BoxMatcher:

Loss