Segment Anything WebUI

This project is based on Segment Anything Model by Meta. The UI is based on Gradio.

Change Logs

[2023-4-11]
- Support video segmentation. A short video can be automatically segmented by SAM.
- Support text prompt segmentation using OWL-ViT (Vision Transformer for Open-World Localization) model. Text prompt is not yet released in the current SAM version, so it is implemented indirectly using OWL-ViT.
[2023-4-15]
- Support points prompt segmentation. But due to this issue, using text and point prompts together may result in an error.
- About boxes prompt, it does not seem possible to draw the box directly in Gradio. One idea is to use two points to represent the box, but this is not accurate or elegant. Also, text prompt implements box prompt indirectly, so I won't implement box prompt directly for now. If you have any ideas about box-drawing in Gradio, please tell me.

Following usage is running on your computer.

pip install git+https://github.com/facebookresearch/segment-anything.git

git clone https://github.com/5663015/segment_anything_webui.git

Make a new folder named checkpoints under this project，and put the downloaded weights files in checkpoints。You can download the weights using following URLs：
- vit_h: ViT-H SAM model
- vit_l: ViT-L SAM model
- vit_b: ViT-B SAM model
Under checkpoints, make a new folder named models--google--owlvit-base-patch32, and put the downloaded OWL-ViT weights files in models--google--owlvit-base-patch32.
Run：

python app.py

Note： Default model is vit_b，the demo can run on CPU. Default device is cpu。

[x] Video segmentation
[x] Add text prompt
[x] Add points prompt
[ ] ~~Add boxes prompt~~
[ ] Try to combine with ControlNet and Stable Diffusion. Use SAM to generate dataset for fine-tuning ControlNet, and generate new image with SD.

Thanks to the wonderful work Segment Anything and OWL-ViT
Some video processing code references kadirnar/segment-anything-video, and some OWL-ViT code references ngthanhtin/owlvit_segment_anything.