ZhengPeng7 / BiRefNet

[CAAI AIR'24] Bilateral Reference for High-Resolution Dichotomous Image Segmentation
https://www.birefnet.top
MIT License
920 stars 73 forks source link

Great work. When will box-based segmentation inference be released? #36

Closed YUANMU227 closed 1 week ago

YUANMU227 commented 1 month ago

Box-based guidance is of great significance for saliency segmentation of specific parts. I saw this item in your TODO list. When will this feature be launched?

ZhengPeng7 commented 1 month ago

Yeah, I'm trying to spare time for it... I may do a simple version of it in the next few days.

ZhengPeng7 commented 1 month ago

Hi, @YUANMU227, I made a colab with box guidance for BiRefNet inference. You can try it now. But now, the box info is manually put into the variable box, which is not user-friendly. I'll make a GUI so that it can obtain the box info by drawings and process multiple boxes.

YUANMU227 commented 1 month ago

Thanks for your work. I manually connected an open set detection model like grounding dino, which can output boxes based on input labels, and the large model can accept almost all common labels. In this way, more flexible boxes can be generated without manually drawing boxes and increasing interaction costs. Maybe later, I can submit a PR to add this function to your repository and contribute my own strength.

ZhengPeng7 commented 1 month ago

Sounds like a class-agnostic object detection first, then the BiRefNet to extract the mask of each proposal? Thanks, that would be really nice!

YUANMU227 commented 1 month ago

The input is the class label, and then the box is detected. Birefnet outputs a mask based on the box. Compared with manually drawing the box, the user only needs to input the class, and the interaction cost is much lower.

YUANMU227 commented 1 month ago

I used the script you provided to cut out the image according to the box. But I found two problems:

  1. If the box fits the object too closely, the foreground and background cannot be effectively distinguished. It is likely that the outer contour of the object will be used as the background, and only the center of the object will be cut out.

  2. The script crops the original image according to the box and then resizes it to 1024, which will cause the resolution of the input image itself to be very low, the input image is unclear, and the segmentation effect is reduced.

For these two problems, I think they can be optimized through two ideas:

  1. Expand the box to include more background content, which is conducive to distinguishing the foreground and background.

  2. It may be trained on low-resolution images, which is conducive to segmentation in this case. An effective way is, for example, for the latest BiRefNet-general-epoch_244.pth model, crop the used dataset to the corresponding size, and then train a new model on the processed dataset. This may be more suitable for this box-based segmentation.

For the second question, do you have any better ideas? Because I don’t have sufficient A100 resources here, if possible, can you provide the model described above?

ZhengPeng7 commented 1 month ago

Thanks for your efforts. Problem-1: it's a kind of problem with the definition of what the target is in the box. I also tried SAM with box prompts, where this problem still exists. It's not a technique problem in my mind. Problem-2: yeah, I did it since it's the simplest way. But padding to the very small objects instead of only resizing and padding + proportional resizing for all objects could be truly better.

Idea-1: Sure, that's the easiest way to alleviate this problem. Idea-2: I know the fixed resolution could be a problem. However, I neither have very sufficient GPU resources for extra training. Thus, it might not be possible to do in the near future.

YUANMU227 commented 1 month ago

Thanks for your reply, I'll give it a try.