facebookresearch / segment-anything

The repository provides code for running inference with the SegmentAnything Model (SAM), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Apache License 2.0
45.67k stars 5.4k forks source link

predictor with text prompt using CLIP #6

Open Usama3059 opened 1 year ago

Usama3059 commented 1 year ago

Hi,

I have implemented text prompt-controlled segmentation using selective search and CLIP. Can you suggest any additional techniques I can include? I am considering trying CLIP-GradCAM #4

https://github.com/Usama3059/SAMtext/blob/main/notebooks/SAM%20_predictor_with_text-prompt_using_CLIP_v1.ipynb

Text prompt: the car side mirror

CLIP suggestion: sam1

SAM segmentation sam2

getorca commented 1 year ago

interesting, I ran your notebook and it picked out the passenger window.

helblazer811 commented 1 year ago

CRIS maybe https://arxiv.org/abs/2111.15174.

I was considering fine tuning a model on RefCOCO. I was kindof surprised they didn't release the text to mask model. I suspect they have followup work coming out soon that does this.

maxi-w commented 1 year ago

I build a small demo where first SAM extracts all objects and then CLIP is used to retrieve the best matching ones: https://github.com/maxi-w/CLIP-SAM

Example for "kiwi": example-segmented

helblazer811 commented 1 year ago

Did you try and merge adjacent segmentation masks into larger objects? I just played around with the web demo and it seems like contiguous objects can sometimes be segmented into many parts. It would be with testing something like this on a dataset like RefCOCO.

helblazer811 commented 1 year ago

I'm wondering what the memory foot print would be to train a prompt encoder based on CLIP embeddings. It said in the FAQ that they used 256 A100 GPUs for 3-5 days lol!

maxi-w commented 1 year ago

Did you try and merge adjacent segmentation masks into larger objects? I just played around with the web demo and it seems like contiguous objects can sometimes be segmented into many parts. It would be with testing something like this on a dataset like RefCOCO.

It seems that sometimes it already segments both the individual parts and the whole thing

Usama3059 commented 1 year ago

I have been working on segmenting small objects(based on patch size value) with low-quality image descriptions, such as car blinkers. I have used a sliding window technique to extract image patches, which were then subjected to clip retrieval, CLIP Grad-CAM, then point extraction and then SAM to get the segmentation results.

Text-prompt: car blinkers

Sample Image: s3

gradcam: s1

notebook: https://github.com/Usama3059/SAMtext/blob/main/notebooks/SAM__predictor_with_text_prompt_using_CLIP_v2.ipynb

There are still some pending tasks, I need to integrate Grad-CAM with SAM, plan to test this technique on more images to assess its effectiveness across different scenarios. Lastly,batch processing when dealing with a large number of patches."

helblazer811 commented 1 year ago

I put together some code for testing the model on RefCOCO.

https://github.com/helblazer811/RefSAM

helblazer811 commented 1 year ago

Agreed it would be way less. I think CLIP alone would not work that well though. CLIP doesn’t do cross attention between text and image features. It really struggles to contrast images with different relative spatial orientations of objects. Most SOTA approaches for Referring Image Segmentation do some sort of Vision Language feature fusion.

On Fri, Apr 7, 2023 at 7:19 AM Denis Hadjivelichkov < @.***> wrote:

I'm wondering what the memory foot print would be to train a prompt encoder based on CLIP embeddings. It said in the FAQ that they used 256 A100 GPUs for 3-5 days lol!

I believe this is for the full model. Presumably training only a CLIP-based text prompt encoder to the same latent space as the bboxes/points (in a self-supervised manner, keeping the existing bbox/point encoder fixed) would be much easier because the rest of the model can be fully ignored? Would definitely be better in the long term than checking all found objects, and reverse engineering with CLIP?

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/segment-anything/issues/6#issuecomment-1500198369, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMGLRTIVQGP5RCLTF4DML3W77Z3PANCNFSM6AAAAAAWUMKGBQ . You are receiving this because you commented.Message ID: @.***>

stefanjaspers commented 1 year ago

I build a small demo where first SAM extracts all objects and then CLIP is used to retrieve the best matching ones: https://github.com/maxi-w/CLIP-SAM

Example for "kiwi": example-segmented

The example code provided seems to work on the vit_h model only, is that correct? I'm getting errors when trying the vit_l model.

YangGangZhiQi commented 1 year ago

Hi, guys, could you provide the FQA links for me? I just could not find it.

helblazer811 commented 1 year ago

https://segment-anything.com/?fbclid=IwAR21Y_KQ-NPW4eU6Wid6Tzp8Z_EuuXzZ7a7O7WhDofJTt-Dd7Lt7qcYym5U

Bottom of this page.

YangGangZhiQi commented 1 year ago

@helblazer811 Thank you very much!

cheerss commented 1 year ago

Usama3059

You may wish to reference our new work: https://arxiv.org/abs/2212.09506, CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation

songxinkuan commented 1 year ago

Two examples of clip_vit_B/32 text embedding, which reduced from 512 to 256 with laion400m image embedding and text embedding. The first image with 'dog' as text prompt, and the second with 'cat' as text prompt. 20230413-102319

20230413-102325

Eli-YiLi commented 1 year ago

I have been working on segmenting small objects(based on patch size value) with low-quality image descriptions, such as car blinkers. I have used a sliding window technique to extract image patches, which were then subjected to clip retrieval, CLIP Grad-CAM, then point extraction and then SAM to get the segmentation results.

Text-prompt: car blinkers

Sample Image: s3

gradcam: s1

notebook: https://github.com/Usama3059/SAMtext/blob/main/SAM__predictor_with_text_prompt_using_CLIP_v2.ipynb

There are still some pending tasks, I need to integrate Grad-CAM with SAM, plan to test this technique on more images to assess its effectiveness across different scenarios. Lastly,batch processing when dealing with a large number of patches."

Grad-CAM is only able to apply on ResNets after ignore self-attention for the middle layer. We propose a solution enables ViTs and works fine on large resolution.

Our work can achieve text to mask with SAM: https://github.com/xmed-lab/CLIP_Surgery

This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.

Besides, it's very simple without any fine-tuning, using the CLIP model itself only.

Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.

This is the jupyter demo: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb

fig4

This is our heatmap: fig3

mydcxiao commented 1 year ago

following

jvpassarelli commented 1 year ago

Thanks for sharing. Inpsired me to write a simple example, similarly adding text prompt to give the segmentation classes. Code here .