UX-Decoder / Segment-Everything-Everywhere-All-At-Once

[NeurIPS 2023] Official implementation of the paper "Segment Everything Everywhere All at Once"
Apache License 2.0
4.41k stars 408 forks source link

Differences between SEEM Focal-L and the Huggingface Demo model? #33

Open dchichkov opened 1 year ago

dchichkov commented 1 year ago

The demo outputs a warning:

The current model is run on SEEM Focal-L, for best performance refer to our demo.

And then the performance of the model seems to be worse than the SEEM demo at the Huggingface. In particular, I've noticed that the segmentation with referring text are worse, they "splash" onto the neighboring objects. What are the differences between the official demo on Huggingface and the published SEEM Focal-L checkpoint and config?

xiezhang666 commented 1 year ago

I found same difference between Focal-L checkpoint and "best performance refer to [our demo]"

woctezuma commented 1 year ago

Maybe related:

MaureenZOU commented 1 year ago

Our demo use Davit-d5 backbone which is different from Focal-L

dchichkov commented 1 year ago

Is there a way to adjust some threshold, to reduce the tendency to "splash" onto the neighboring objects when the segmentation with referring text is used?

When the segmentation mask for referring text is splashing onto the neighboring objects, the functionality of the model is very limited. In the default segmentation mode a lot of less common objects simply get ignored. So the segments are not available. And then, if the model is prompted for a particular object (i.e. we know that the object should be there), the segment is splashing onto the neighboring objects. Which, again, is not useful.

And it'd be great to have that checkpoint from the official demo 🐶

MaureenZOU commented 1 year ago

Thanks so much for the comments, could you please provide some example images? Again, for referring segmentation, I highly suggesting of using X-Decoder instead f seem, as SEEM is ONLY trained with COCO.

dchichkov commented 1 year ago

Thank you!

For Seem/X-Decoder, I see the checkpoint, but I don't seem to be able to find the config in the repository. Should this be something like xdecoder_focall_lang.yaml?

Sure, if it helps, here's the image on which I see the the segmentation spreading onto the nearby pixels: image And the original image is here, with the prompt: forklift. test

LWprogramming commented 11 months ago

Thanks so much for the comments, could you please provide some example images? Again, for referring segmentation, I highly suggesting of using X-Decoder instead f seem, as SEEM is ONLY trained with COCO.

When I try this, I get this error message:

'GeneralizedXdecoder' object has no attribute 'evaluate_demo'

Following the demo code, here's what I'm doing:

from modeling.BaseModel import BaseModel
from modeling import build_model
from utils.distributed import init_distributed
from utils.arguments import load_opt_from_config_files
from utils.constants import COCO_PANOPTIC_CLASSES

from demo.seem.tasks import interactive_infer_image

opt = load_opt_from_config_files(["configs/xdecoder/focall_unicl_lang.yaml"]) # xdecoder over seem for referring segmentation: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/issues/33#issuecomment-1544714768
opt = init_distributed(opt)

cur_model = 'Focal-L'
checkpoints_folder = "/path/to/folder"
checkpoint_name = "xdecoder_focall_last.pt"
pretrained_pth = os.path.join(checkpoints_folder, checkpoint_name)

model = BaseModel(opt, build_model(opt)).from_pretrained(pretrained_pth).eval().cuda()
with torch.no_grad():
    model.model.sem_seg_head.predictor.lang_encoder.get_text_embeddings(COCO_PANOPTIC_CLASSES + ["background"], is_eval=True)

audio = None
@torch.no_grad()
def inference(image, task, *args, **kwargs):
    with torch.autocast(device_type='cuda', dtype=torch.float16):
        return interactive_infer_image(model, audio, image, task, *args, **kwargs)

result_image = interactive_infer_image(
    model=model,  # your trained model object
    image={'image': main_image}, 
    tasks = ["Example"],
    refimg={"image": clothing_image, "mask": clothing_image_mask_3d},
# crashes 
)