Open WalBouss opened 1 year ago
Thanks for your interest. Actually, "background" is just another class and the models can handle it quite well. Even Grounded-SAM is better when the background class is provided (which is a bit surprising given Grounding DINO and that we change every non-object pixel to the background class).
Depending on the use case you may want to use "background", or "others" if the area of interest is also in the foreground/includes clear objects (e.g., in medical use cases). But it is better to use a more specific label if the background class can be specified, e.g., "water and floods" for the WorldFloods dataset.
You could also tune the background prompt by adding objects to the label that are not present in the other classes. E.g., the results for iSAID are better with "background or road or ground or water or ..." instead of just "background". However, we did not include it in MESS because you probably don't want to do such tuning in a real-world setting.
Thank you for your fantastic work and effort to evaluate zero-shot open vocabulary segmentation models properly.
I am curious about your approach to handling predictions related to the background class. Specifically, I'm interested in how you address the issue of predicting the background class, considering that you cannot directly input something like "a photo of a background" as a text embedding for probing the model.
Best regards,