Open Usama3059 opened 1 year ago
interesting, I ran your notebook and it picked out the passenger window.
CRIS maybe https://arxiv.org/abs/2111.15174.
I was considering fine tuning a model on RefCOCO. I was kindof surprised they didn't release the text to mask model. I suspect they have followup work coming out soon that does this.
I build a small demo where first SAM extracts all objects and then CLIP is used to retrieve the best matching ones: https://github.com/maxi-w/CLIP-SAM
Example for "kiwi":
Did you try and merge adjacent segmentation masks into larger objects? I just played around with the web demo and it seems like contiguous objects can sometimes be segmented into many parts. It would be with testing something like this on a dataset like RefCOCO.
I'm wondering what the memory foot print would be to train a prompt encoder based on CLIP embeddings. It said in the FAQ that they used 256 A100 GPUs for 3-5 days lol!
Did you try and merge adjacent segmentation masks into larger objects? I just played around with the web demo and it seems like contiguous objects can sometimes be segmented into many parts. It would be with testing something like this on a dataset like RefCOCO.
It seems that sometimes it already segments both the individual parts and the whole thing
I have been working on segmenting small objects(based on patch size value) with low-quality image descriptions, such as car blinkers. I have used a sliding window technique to extract image patches, which were then subjected to clip retrieval, CLIP Grad-CAM, then point extraction and then SAM to get the segmentation results.
Text-prompt: car blinkers
Sample Image:
gradcam:
There are still some pending tasks, I need to integrate Grad-CAM with SAM, plan to test this technique on more images to assess its effectiveness across different scenarios. Lastly,batch processing when dealing with a large number of patches."
I put together some code for testing the model on RefCOCO.
Agreed it would be way less. I think CLIP alone would not work that well though. CLIP doesn’t do cross attention between text and image features. It really struggles to contrast images with different relative spatial orientations of objects. Most SOTA approaches for Referring Image Segmentation do some sort of Vision Language feature fusion.
On Fri, Apr 7, 2023 at 7:19 AM Denis Hadjivelichkov < @.***> wrote:
I'm wondering what the memory foot print would be to train a prompt encoder based on CLIP embeddings. It said in the FAQ that they used 256 A100 GPUs for 3-5 days lol!
I believe this is for the full model. Presumably training only a CLIP-based text prompt encoder to the same latent space as the bboxes/points (in a self-supervised manner, keeping the existing bbox/point encoder fixed) would be much easier because the rest of the model can be fully ignored? Would definitely be better in the long term than checking all found objects, and reverse engineering with CLIP?
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/segment-anything/issues/6#issuecomment-1500198369, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADMGLRTIVQGP5RCLTF4DML3W77Z3PANCNFSM6AAAAAAWUMKGBQ . You are receiving this because you commented.Message ID: @.***>
I build a small demo where first SAM extracts all objects and then CLIP is used to retrieve the best matching ones: https://github.com/maxi-w/CLIP-SAM
Example for "kiwi":
The example code provided seems to work on the vit_h model only, is that correct? I'm getting errors when trying the vit_l model.
Hi, guys, could you provide the FQA links for me? I just could not find it.
@helblazer811 Thank you very much!
Usama3059
You may wish to reference our new work: https://arxiv.org/abs/2212.09506, CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation
Two examples of clip_vit_B/32 text embedding, which reduced from 512 to 256 with laion400m image embedding and text embedding. The first image with 'dog' as text prompt, and the second with 'cat' as text prompt.
I have been working on segmenting small objects(based on patch size value) with low-quality image descriptions, such as car blinkers. I have used a sliding window technique to extract image patches, which were then subjected to clip retrieval, CLIP Grad-CAM, then point extraction and then SAM to get the segmentation results.
Text-prompt: car blinkers
Sample Image:
gradcam:
notebook: https://github.com/Usama3059/SAMtext/blob/main/SAM__predictor_with_text_prompt_using_CLIP_v2.ipynb
There are still some pending tasks, I need to integrate Grad-CAM with SAM, plan to test this technique on more images to assess its effectiveness across different scenarios. Lastly,batch processing when dealing with a large number of patches."
Grad-CAM is only able to apply on ResNets after ignore self-attention for the middle layer. We propose a solution enables ViTs and works fine on large resolution.
Our work can achieve text to mask with SAM: https://github.com/xmed-lab/CLIP_Surgery
This is our work about CLIP's explainability. It's able to guide SAM to achieve text to mask without manual points.
Besides, it's very simple without any fine-tuning, using the CLIP model itself only.
Furthermore, it enhances many open-vocabulary tasks, like segmentation, multi-label classification, multimodal visualization.
This is the jupyter demo: https://github.com/xmed-lab/CLIP_Surgery/blob/master/demo.ipynb
This is our heatmap:
following
Thanks for sharing. Inpsired me to write a simple example, similarly adding text prompt to give the segmentation classes. Code here .
Hi,
I have implemented text prompt-controlled segmentation using selective search and CLIP. Can you suggest any additional techniques I can include? I am considering trying CLIP-GradCAM #4
https://github.com/Usama3059/SAMtext/blob/main/notebooks/SAM%20_predictor_with_text-prompt_using_CLIP_v1.ipynb
Text prompt: the car side mirror
CLIP suggestion:![sam1](https://user-images.githubusercontent.com/43669817/230163776-1798e693-a4e8-4ecc-be51-9a7823a780b1.png)
SAM segmentation![sam2](https://user-images.githubusercontent.com/43669817/230163870-d8fbddbb-4be5-434d-a923-58754f482df8.png)