Open dchichkov opened 1 year ago
I found same difference between Focal-L checkpoint and "best performance refer to [our demo]"
Our demo use Davit-d5 backbone which is different from Focal-L
Is there a way to adjust some threshold, to reduce the tendency to "splash" onto the neighboring objects when the segmentation with referring text is used?
When the segmentation mask for referring text is splashing onto the neighboring objects, the functionality of the model is very limited. In the default segmentation mode a lot of less common objects simply get ignored. So the segments are not available. And then, if the model is prompted for a particular object (i.e. we know that the object should be there), the segment is splashing onto the neighboring objects. Which, again, is not useful.
And it'd be great to have that checkpoint from the official demo 🐶
Thanks so much for the comments, could you please provide some example images? Again, for referring segmentation, I highly suggesting of using X-Decoder instead f seem, as SEEM is ONLY trained with COCO.
Thank you!
For Seem/X-Decoder, I see the checkpoint, but I don't seem to be able to find the config in the repository. Should this be something like xdecoder_focall_lang.yaml?
Sure, if it helps, here's the image on which I see the the segmentation spreading onto the nearby pixels: And the original image is here, with the prompt: forklift.
Thanks so much for the comments, could you please provide some example images? Again, for referring segmentation, I highly suggesting of using X-Decoder instead f seem, as SEEM is ONLY trained with COCO.
When I try this, I get this error message:
'GeneralizedXdecoder' object has no attribute 'evaluate_demo'
Following the demo code, here's what I'm doing:
from modeling.BaseModel import BaseModel
from modeling import build_model
from utils.distributed import init_distributed
from utils.arguments import load_opt_from_config_files
from utils.constants import COCO_PANOPTIC_CLASSES
from demo.seem.tasks import interactive_infer_image
opt = load_opt_from_config_files(["configs/xdecoder/focall_unicl_lang.yaml"]) # xdecoder over seem for referring segmentation: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once/issues/33#issuecomment-1544714768
opt = init_distributed(opt)
cur_model = 'Focal-L'
checkpoints_folder = "/path/to/folder"
checkpoint_name = "xdecoder_focall_last.pt"
pretrained_pth = os.path.join(checkpoints_folder, checkpoint_name)
model = BaseModel(opt, build_model(opt)).from_pretrained(pretrained_pth).eval().cuda()
with torch.no_grad():
model.model.sem_seg_head.predictor.lang_encoder.get_text_embeddings(COCO_PANOPTIC_CLASSES + ["background"], is_eval=True)
audio = None
@torch.no_grad()
def inference(image, task, *args, **kwargs):
with torch.autocast(device_type='cuda', dtype=torch.float16):
return interactive_infer_image(model, audio, image, task, *args, **kwargs)
result_image = interactive_infer_image(
model=model, # your trained model object
image={'image': main_image},
tasks = ["Example"],
refimg={"image": clothing_image, "mask": clothing_image_mask_3d},
# crashes
)
The demo outputs a warning:
And then the performance of the model seems to be worse than the SEEM demo at the Huggingface. In particular, I've noticed that the segmentation with referring text are worse, they "splash" onto the neighboring objects. What are the differences between the official demo on Huggingface and the published SEEM Focal-L checkpoint and config?