facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
46.99k stars 5.56k forks source link

Text as prompts #93

Open peiwang062 opened 1 year ago

peiwang062 commented 1 year ago

Thanks for leasing this wonderful work! I saw the demo shows examples of using point, box as input prompt. Does the demo support text as prompt?

stefanjaspers commented 1 year ago

Following! Text prompting has been mentioned in the research paper but hasn't been released yet. Really looking forward to this feature because I need it for a specific use case.

darvilabtech commented 1 year ago

Exactly, wait for it to be released

HaoZhang990127 commented 1 year ago

Thank you for your exciting work!

I also want to use text as prompt to generate mask in my project. Now i am using ClipSeg to generate the mask, but it can not performance well in fine grained semantics.

When do you plan to open source the code of text as prompt? What is the approximate time line? Waiting for this amazing work.

jy00161yang commented 1 year ago

following

eware-godaddy commented 1 year ago

following

0xbitches commented 1 year ago

The paper mentioned they used CLIP to handle text prompts:

We represent points and boxes by positional encodings [95] summed with learned embeddings for each prompt type and free-form text with an off-the-shelf text encoder from CLIP [82].

It appears the demo does not seem to allow textual inputs though.

darvilabtech commented 1 year ago

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

peiwang062 commented 1 year ago

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

yes, we could simply combine these two, but if SAM can do it better, why do we need two models. We don’t know if Grounding Dino is the bottleneck if we just use its output to SAM.

alexw994 commented 1 year ago

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

narbhar commented 1 year ago

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

luca-medeiros commented 1 year ago

Put together a demo of grounded-segment-anything with radio for better testing. I tested using clip, open-clip, and groundingdino. Groudingdino performs much better with a great performance. Less than 1 sec on a A100 for DINO+SAM. Maybe ill add the clip versions as well. https://github.com/luca-medeiros/lang-segment-anything

alexw994 commented 1 year ago

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

peiwang062 commented 1 year ago

@peiwang062 @stefanjaspers @HaoZhang990127 @eware-godaddy @jy00161yang https://github.com/IDEA-Research/Grounded-Segment-Anything does what we are all looking after.

why not use output of SAM as bounding box?

The current version of SAM without of the CLIP text encoder only gives instances from image points or bounding boxes as prompts. Thus SAM's output is only instances without semantic information about the segmentation. With a text encoder you can correlate SAM's output with text such as an object of interest in an image. If you have a promptable text based object detector you can bridge this link for the time being thus SAM's output not generic instance segmentation anymore which is what Grounded Segment Anything is helping to do.

If SAM just segment in the bounding box, I think many methods can be used for this also. Like BoxInstSeg https://github.com/LiWentomng/BoxInstSeg

It should be able to support boxes, points, masks and text as prompts as the paper mentions, no?

nikolausWest commented 1 year ago

following

yash0307 commented 1 year ago

following

9p15p commented 1 year ago

Following

fyuf commented 1 year ago

Following

Zhangwenyao1 commented 1 year ago

following

Eli-YiLi commented 1 year ago

Our work can achieve text to mask with SAM: https://github.com/xmed-lab/CLIP_Surgery

This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.

Besides, it's very simple without any fine-tuning, using the CLIP model itself only.

Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.

This is the jupyter demo: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb

fig4

zaojiahua commented 1 year ago

following

FrancisDacian commented 1 year ago

can try the result using this explorer extension https://chrome.google.com/webstore/detail/text-prompts-for-segment/jndfmkiclniflknfifngodjnmlibhjdo/related

ignoHH commented 1 year ago

following

bjccdsrlcr commented 1 year ago

following

mydcxiao commented 1 year ago

following

xuxiaoxxxx commented 1 year ago

+1

daminnock commented 1 year ago

following

Alice1820 commented 1 year ago

following

freshman97 commented 1 year ago

following

zhangjingxian1998 commented 1 year ago

waiting for it

N-one commented 1 year ago

following

moktsuiqin commented 1 year ago

following