ZhengPeng7 / BiRefNet

[CAAI AIR'24] Bilateral Reference for High-Resolution Dichotomous Image Segmentation
https://www.birefnet.top
MIT License
991 stars 77 forks source link

Finetuning for semi-transparent glass background removal #42

Closed mehamednews closed 1 month ago

mehamednews commented 1 month ago

Hi, First, I'd like to thank you for your amazing work (& the amount of effort you're putting into answering questions). I'd appreciate your insights & suggestions regarding the possibility of fine-tuning one of the pre-trained models (not sure which one would be best here) on images of cars. My main goal is to remove what's behind the windows of cars. I think an example will explain this better: 100307 100307-mask

I generated ~50k images (800x600) with their respective masks, would this be too much?

ZhengPeng7 commented 1 month ago

Do you mean you want to extract the target car by a mask with a transparent channel?

Regarding the dataset, ~50k images are enough for training, in my experience, even from scratch. Perhaps training from scratch could be better (I'm not sure) -- 1). ~50k is already a large number, only if the mask labels are valid and sound; 2). images were resized to 1024x1024 in my experiments, so there might be a gap between your images (800x600).

But anyway, only experiments can tell everything. Tell me if you have any problem with it.

mehamednews commented 1 month ago

Thanks for the quick answer. Yes, I checked the DIS dataset and I noticed they have solid masks (either 100% white or 100% black) what I'm trying to achieve is to get gray in areas corresponding to windows where the environment peaks through (not sure how to explain it :smile:)

current result: image

target result: image

I'm going to create a dataset with 1024px in width (still 4:3) and test with it.

ZhengPeng7 commented 1 month ago

Thanks for your explanation. I understand your need here, where the data is similar to that in the matting tasks. For example, here is a GT example below from the portrait segmentation dataset P3M-10k -- see the pixel values in the regions of hair for an easy check.

One more important thing is that though GT labels in datasets like DIS5K are in {0, 1}, the predicted maps are always float numbers in the range of (0, 1), some techniques were even proposed to push these values more confident -- to 0 or to 1, instead of to 0.5. What I want to mean is that it only depends on the datasets provided for training. And as I said before, there might be data domain gap between your custom data and DIS5K or used data. Therefore, if you have insufficient GPUs, you can fine-tune the provided general-use weights. If possible, I recommend training from scratch to examine the accuracy.

p_0b5f94f1