ZhangGongjie / SAM-DETR

[CVPR'2022] SAM-DETR & SAM-DETR++: Official PyTorch Implementation
MIT License
292 stars 50 forks source link

Generalized Box IoU consistently reporting degenerate boxes #1

Closed amorehead closed 2 years ago

amorehead commented 2 years ago

Hello. First off, thank you for making your work's code available here on GitHub. It is well organized and maintained.

My question is, since I have tried applying your SAM-DETR model to my custom object detection dataset for training and validation, I am consistently seeing the generalized_box_iou utility function raise AssertionErrors saying the model's predicted bounding boxes are degenerate as the (lx, ly) coordinates are greater than the (rx, ry) coordinates (this check makes sense to me, however, I am not sure how to solve the issue). I have also added a check on the len(boxes1) > 0 to make sure at least one box was predicted in a batch of images.

https://github.com/ZhangGongjie/SAM-DETR/blob/aaf193689a588a0b069b3cac83a5b429f745f2e2/util/box_ops.py#L44

Would you have any ideas why the model would be predicting degenerate bounding box coordinates from time to time, ending training prematurely?

ZhangGongjie commented 2 years ago

Thanks for your interest in our work!

The generalized_box_iou utility is the standard function used in DETR and deformable DETR without any modification. So I don't know what's wrong either. May I know whether assert False happens to boxes1 or boxes2? Besides, may I know how often does it happen? Does it happen at the beginning stage of training or just happen randomly?

FYI, our codes can support 0 target box in a batch of images.

amorehead commented 2 years ago

@ZhangGongjie, thanks for your quick reply.

The assert is happening for boxes1 in my case, meaning that (I believe) the model's predicted bounding boxes are invalid.

In generalized_box_iou():

assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError

I am currently collaborating with others to train the model on our custom object detection dataset, and I am awaiting a reply from them about when the issue occurs. When testing the model locally on a small portion of our dataset, it appears to me that the model only raises this Exception for certain images in our dataset. I will continue investigating whether it is, in fact, certain images that are causing this problem in the model (and if so, how to fix the issue e.g., by excluding these images)

ZhangGongjie commented 2 years ago

@amorehead Thanks for the reply.

You are right about the model's predicted bounding boxes are invalid. But this should not have happened. Because SAM-DETR predicts a set of (xc, yc, w, h) for each object query, and xc, yc, w, h are all in the range of (0, 1) since they are generated through a sigmoid.

So far I couldn't give any specific suggestions based on the information right now. But I suggest checking whether there is Nan or Inf in the output.

When you have further information, please do let me know. Thank you! : )

amorehead commented 2 years ago

@ZhangGongjie,

Based on your experience, what ranges for the class_error and scaled_loss metrics would you consider to be reasonable?

image

I ask because I am seeing fairly large values for the scaled_loss metric (i.e., over 10,000), and am not sure if this large value implies a bug in how I'm using the model or instead is of normal scale for training a SAM-DETR model.

ZhangGongjie commented 2 years ago

scaled_loss should never exceed 10.0, even at the beginning stage. Clearly, it does not converge. :( Class_error is simply an index over classification correctness, which does not participate in back-propagation. This error (~80%) indicates that the network simply does not learn anything.

ZhangGongjie commented 2 years ago

Could you please have a check whether SAM-DETR can converge on MS-COCO on your server?