Text prompt? - Githubissues

facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.

Apache License 2.0

46.9k stars 5.56k forks source link

Text prompt? #4

Closed asetsuna closed 1 year ago

asetsuna commented 1 year ago

Amazing work! However, I didn't find the text prompt support , is there any plan to release it?

ORippler commented 1 year ago

From their FAQ.

ericmintun commented 1 year ago

This is correct, the ability to take text prompts as input is not currently released.

LuoYingzhao commented 1 year ago

This is correct, the ability to take text prompts as input is not currently released.

so， is there a plan to release the text prompts ability?

Rocsg commented 1 year ago

The paper state that text can be a prompt (theoretically). And describe a little bit the procedure (page 22) that seem to involve a bit retraining SAM including couples (image, text) embeddings computed by a CLIP model. And finally, doing inference with SAM, by using the CLIP model to create direct prompt for SAM, as CLIP aligned the embeddings of text with the embeddings of the image. Not sure there is easy way to unleash this functionality, even more if it involves retraining (I guess the .pth provided does not include the CLIP training).

botcs commented 1 year ago

@Rocsg I can imagine that by using only a few examples and a linear projection layer on both spaces one could see if the SAM feature space is aligned with the CLIP feature space or not. If there's no alignment the matching error would be still high.

Eli-YiLi commented 1 year ago

We believe directly using the text from text features from CLIP is not good. Because the explainability of CLIP is bad. Using the text features matched with opposite semantic regions lead to wrong results.

This is our work about CLIP's explainability: https://github.com/xmed-lab/CLIP_Surgery

And we can see the self-attention of CLIP links irrelevant regions, with serious noisy activations across labels. fig1 fig2

We suggest using the corrected heatmap to generate points to replace the manual input points. This is our similarity map from the raw prediction of CLIP and results on SAM. fig3 fig4

Besides, it's very simple, just use the original CLIP without any fine-tuning or extra supervisions. It's also another solution besides text->box->mask, with requires the least training and supervision cost.

jmiemirza commented 1 year ago

https://github.com/luca-medeiros/lang-segment-anything

this project looks interesting

jashvira commented 1 year ago

Maybe they were waiting for DinoV2?

yassineAlouini commented 1 year ago

It looks like the Grounded-SAM paper implement this => https://github.com/IDEA-Research/Grounded-Segment-Anything.

As pointed out by @jashvira, it is based on Grounding-Dino that is itself using Dino.