[SAM2 Paper
] [Grounding DINO Paper
] [Grounding DINO 1.5 Paper
] [BibTeX
]
🔥 Project Highlight
Grounded SAM 2 is a foundation model pipeline towards grounding and track anything in Videos with Grounding DINO, Grounding DINO 1.5, Florence-2 and SAM 2.
In this repo, we've supported the following demo with simple implementations:
Grounded SAM 2 does not introduce significant methodological changes compared to Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks. Both approaches leverage the capabilities of open-world models to address complex visual tasks. Consequently, we try to simplify the code implementation in this repository, aiming to enhance user convenience.
2024/10/24
: Support SAHI (Slicing Aided Hyper Inference) on Grounded SAM 2 (with Grounding DINO 1.5) which may be helpful for inferencing high resolution image with dense small objects (e.g. 4K images).2024/10/10
: Support SAM-2.1
models, if you want to use SAM 2.1
model, you need to update to the latest code and reinstall SAM 2 follow SAM 2.1 Installation.2024/08/31
: Support dump json results
in Grounded SAM 2 Image Demos (with Grounding DINO).2024/08/20
: Support Florence-2 SAM 2 Image Demo which includes dense region caption
, object detection
, phrase grounding
, and cascaded auto-label pipeline caption + phrase grounding
.2024/08/09
: Support Ground and Track New Object throughout the whole videos. This feature is still under development now. Credits to Shuo Shen.2024/08/07
: Support Custom Video Inputs, users need only submit their video file (e.g. .mp4
file) with specific text prompts to get an impressive demo videos.Download the pretrained SAM 2
checkpoints:
cd checkpoints
bash download_ckpts.sh
Download the pretrained Grounding DINO
checkpoints:
cd gdino_checkpoints
bash download_ckpts.sh
Install PyTorch environment first. We use python=3.10
, as well as torch >= 2.3.1
, torchvision>=0.18.1
and cuda-12.1
in our environment to run this demo. Please follow the instructions here to install both PyTorch and TorchVision dependencies. Installing both PyTorch and TorchVision with CUDA support is strongly recommended. You can easily install the latest version of PyTorch as follows:
pip3 install torch torchvision torchaudio
Since we need the CUDA compilation environment to compile the Deformable Attention
operator used in Grounding DINO, we need to check whether the CUDA environment variables have been set correctly (which you can refer to Grounding DINO Installation for more details). You can set the environment variable manually as follows if you want to build a local GPU environment for Grounding DINO to run Grounded SAM 2:
export CUDA_HOME=/path/to/cuda-12.1/
Install Segment Anything 2
:
pip install -e .
Install Grounding DINO
:
pip install --no-build-isolation -e grounding_dino
Build the Docker image and Run the Docker container:
cd Grounded-SAM-2
make build-image
make run
After executing these commands, you will be inside the Docker environment. The working directory within the container is set to: /home/appuser/Grounded-SAM-2
Once inside the Docker environment, you can start the demo by running:
python grounded_sam2_tracking_demo.py
Note that Grounding DINO
has already been supported in Huggingface, so we provide two choices for running Grounded SAM 2
model:
python grounded_sam2_hf_model_demo.py
[!NOTE] 🚨 If you encounter network issues while using the
HuggingFace
model, you can resolve them by setting the appropriate mirror source asexport HF_ENDPOINT=https://hf-mirror.com
python grounded_sam2_local_demo.py
We've already released our most capable open-set detection model Grounding DINO 1.5 & 1.6, which can be combined with SAM 2 for stronger open-set detection and segmentation capability. You can apply the API token first and run Grounded SAM 2 with Grounding DINO 1.5 as follows:
Install the latest DDS cloudapi:
pip install dds-cloudapi-sdk --upgrade
Apply your API token from our official website here: request API token.
python grounded_sam2_gd1.5_demo.py
If your images are high resolution with dense objects, directly using Grounding DINO 1.5 for inference on the original image may not be the best choice. We support SAHI (Slicing Aided Hyper Inference), which works by first dividing the original image into smaller overlapping patches. Inference is then performed separately on each patch, and the final detection results are merged. This method is highly effective and accuracy for dense and small objects detection in high resolution images.
You can run SAHI inference by setting the following param in grounded_sam2_gd1.5_demo.py:
WITH_SLICE_INFERENCE = True
The visualization is shown as follows:
Text Prompt | Input Image | Grounded SAM 2 | Grounded SAM 2 with SAHI |
---|---|---|---|
Person |
After setting DUMP_JSON_RESULTS=True
in the following Grounded SAM 2 Image Demos:
The grounding
and segmentation
results will be automatically saved in the outputs
dir with the following format:
{
"image_path": "path/to/image.jpg",
"annotations": [
{
"class_name": "class_name",
"bbox": [x1, y1, x2, y2],
"segmentation": {
"size": [h, w],
"counts": "rle_encoded_mask"
},
"score": confidence score
}
],
"box_format": "xyxy",
"img_width": w,
"img_height": h
}
Based on the strong tracking capability of SAM 2, we can combined it with Grounding DINO for open-set object segmentation and tracking. You can run the following scripts to get the tracking results with Grounded SAM 2:
python grounded_sam2_tracking_demo.py
./tracking_results
children_tracking_demo_video.mp4
We've supported different types of prompt for Grounded SAM 2 tracking demo:
We've also support video object tracking demo based on our stronger Grounding DINO 1.5
model and SAM 2
, you can try the following demo after applying the API keys for running Grounding DINO 1.5
:
python grounded_sam2_tracking_demo_with_gd1.5.py
Users can upload their own video file (e.g. assets/hippopotamus.mp4
) and specify their custom text prompts for grounding and tracking with Grounding DINO and SAM 2 by using the following scripts:
python grounded_sam2_tracking_demo_custom_video_input_gd1.0_hf_model.py
If you are not convenient to use huggingface demo, you can also run tracking demo with local grounding dino model with the following scripts:
python grounded_sam2_tracking_demo_custom_video_input_gd1.0_local_model.py
Users can upload their own video file (e.g. assets/hippopotamus.mp4
) and specify their custom text prompts for grounding and tracking with Grounding DINO 1.5 and SAM 2 by using the following scripts:
python grounded_sam2_tracking_demo_custom_video_input_gd1.5.py
You can specify the params in this file:
VIDEO_PATH = "./assets/hippopotamus.mp4"
TEXT_PROMPT = "hippopotamus."
OUTPUT_VIDEO_PATH = "./hippopotamus_tracking_demo.mp4"
API_TOKEN_FOR_GD1_5 = "Your API token" # api token for G-DINO 1.5
PROMPT_TYPE_FOR_VIDEO = "mask" # using SAM 2 mask prediction as prompt for video predictor
After running our demo code, you can get the tracking results as follows:
And we will automatically save the tracking visualization results in OUTPUT_VIDEO_PATH
.
[!WARNING] We initialize the box prompts on the first frame of the input video. If you want to start from different frame, you can refine
ann_frame_idx
by yourself in our code.
In above demos, we only prompt Grounded SAM 2 in specific frame, which may not be friendly to find new object during the whole video. In this demo, we try to find new objects and assign them with new ID across the whole video, this function is still under develop. it's not that stable now.
Users can upload their own video files and specify custom text prompts for grounding and tracking using the Grounding DINO and SAM 2 frameworks. To do this, execute the script:
python grounded_sam2_tracking_demo_with_continuous_id.py
You can customize various parameters including:
text
: The grounding text prompt.video_dir
: Directory containing the video files.output_dir
: Directory to save the processed output.output_video_path
: Path for the output video.step
: Frame stepping for processing.box_threshold
: box threshold for groundingdino modeltext_threshold
: text threshold for groundingdino model
Note: This method supports only the mask type of text prompt.After running our demo code, you can get the tracking results as follows:
If you want to try Grounding DINO 1.5
model, you can run the following scripts after setting your API token:
python grounded_sam2_tracking_demo_with_continuous_id_gd1.5.py
This method could simply cover the whole lifetime of the object
python grounded_sam2_tracking_demo_with_continuous_id_plus.py
In this section, we will explore how to integrate the feature-rich and robust open-source models Florence-2 and SAM 2 to develop practical applications.
Florence-2 is a powerful vision foundation model by Microsoft which supports a series of vision tasks by prompting with special task_prompt
includes but not limited to:
Task | Task Prompt | Text Input | Task Introduction |
---|---|---|---|
Object Detection | <OD> |
✘ | Detect main objects with single category name |
Dense Region Caption | <DENSE_REGION_CAPTION> |
✘ | Detect main objects with short description |
Region Proposal | <REGION_PROPOSAL> |
✘ | Generate proposals without category name |
Phrase Grounding | <CAPTION_TO_PHRASE_GROUNDING> |
✔ | Ground main objects in image mentioned in caption |
Referring Expression Segmentation | <REFERRING_EXPRESSION_SEGMENTATION> |
✔ | Ground the object which is most related to the text input |
Open Vocabulary Detection and Segmentation | <OPEN_VOCABULARY_DETECTION> |
✔ | Ground any object with text input |
Integrate Florence-2
with SAM-2
, we can build a strong vision pipeline to solve complex vision tasks, you can try the following scripts to run the demo:
[!NOTE] 🚨 If you encounter network issues while using the
HuggingFace
model, you can resolve them by setting the appropriate mirror source asexport HF_ENDPOINT=https://hf-mirror.com
Object Detection and Segmentation
python grounded_sam2_florence2_image_demo.py \
--pipeline object_detection_segmentation \
--image_path ./notebooks/images/cars.jpg
Dense Region Caption and Segmentation
python grounded_sam2_florence2_image_demo.py \
--pipeline dense_region_caption_segmentation \
--image_path ./notebooks/images/cars.jpg
Region Proposal and Segmentation
python grounded_sam2_florence2_image_demo.py \
--pipeline region_proposal_segmentation \
--image_path ./notebooks/images/cars.jpg
Phrase Grounding and Segmentation
python grounded_sam2_florence2_image_demo.py \
--pipeline phrase_grounding_segmentation \
--image_path ./notebooks/images/cars.jpg \
--text_input "The image shows two vintage Chevrolet cars parked side by side, with one being a red convertible and the other a pink sedan, \
set against the backdrop of an urban area with a multi-story building and trees. \
The cars have Cuban license plates, indicating a location likely in Cuba."
Referring Expression Segmentation
python grounded_sam2_florence2_image_demo.py \
--pipeline referring_expression_segmentation \
--image_path ./notebooks/images/cars.jpg \
--text_input "The left red car."
Open-Vocabulary Detection and Segmentation
python grounded_sam2_florence2_image_demo.py \
--pipeline open_vocabulary_detection_segmentation \
--image_path ./notebooks/images/cars.jpg \
--text_input "car <and> building"
<and>
in your input text.Florence-2
can be used as a auto image annotator by cascading its caption capability with its grounding capability.
Task | Task Prompt | Text Input |
---|---|---|
Caption + Phrase Grounding | <CAPTION> + <CAPTION_TO_PHRASE_GROUNDING> |
✘ |
Detailed Caption + Phrase Grounding | <DETAILED_CAPTION> + <CAPTION_TO_PHRASE_GROUNDING> |
✘ |
More Detailed Caption + Phrase Grounding | <MORE_DETAILED_CAPTION> + <CAPTION_TO_PHRASE_GROUNDING> |
✘ |
You can try the following scripts to run these demo:
Caption to Phrase Grounding
python grounded_sam2_florence2_autolabel_pipeline.py \
--image_path ./notebooks/images/groceries.jpg \
--pipeline caption_to_phrase_grounding \
--caption_type caption
caption_type
to control the granularity of the caption, if you want a more detailed caption, you can try --caption_type detailed_caption
or --caption_type more_detailed_caption
.If you find this project helpful for your research, please consider citing the following BibTeX entry.
@misc{ravi2024sam2segmentimages,
title={SAM 2: Segment Anything in Images and Videos},
author={Nikhila Ravi and Valentin Gabeur and Yuan-Ting Hu and Ronghang Hu and Chaitanya Ryali and Tengyu Ma and Haitham Khedr and Roman Rädle and Chloe Rolland and Laura Gustafson and Eric Mintun and Junting Pan and Kalyan Vasudev Alwala and Nicolas Carion and Chao-Yuan Wu and Ross Girshick and Piotr Dollár and Christoph Feichtenhofer},
year={2024},
eprint={2408.00714},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.00714},
}
@article{liu2023grounding,
title={Grounding dino: Marrying dino with grounded pre-training for open-set object detection},
author={Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others},
journal={arXiv preprint arXiv:2303.05499},
year={2023}
}
@misc{ren2024grounding,
title={Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection},
author={Tianhe Ren and Qing Jiang and Shilong Liu and Zhaoyang Zeng and Wenlong Liu and Han Gao and Hongjie Huang and Zhengyu Ma and Xiaoke Jiang and Yihao Chen and Yuda Xiong and Hao Zhang and Feng Li and Peijun Tang and Kent Yu and Lei Zhang},
year={2024},
eprint={2405.10300},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{ren2024grounded,
title={Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks},
author={Tianhe Ren and Shilong Liu and Ailing Zeng and Jing Lin and Kunchang Li and He Cao and Jiayu Chen and Xinyu Huang and Yukang Chen and Feng Yan and Zhaoyang Zeng and Hao Zhang and Feng Li and Jie Yang and Hongyang Li and Qing Jiang and Lei Zhang},
year={2024},
eprint={2401.14159},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@article{kirillov2023segany,
title={Segment Anything},
author={Kirillov, Alexander and Mintun, Eric and Ravi, Nikhila and Mao, Hanzi and Rolland, Chloe and Gustafson, Laura and Xiao, Tete and Whitehead, Spencer and Berg, Alexander C. and Lo, Wan-Yen and Doll{\'a}r, Piotr and Girshick, Ross},
journal={arXiv:2304.02643},
year={2023}
}
@misc{jiang2024trex2,
title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy},
author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang},
year={2024},
eprint={2403.14610},
archivePrefix={arXiv},
primaryClass={cs.CV}
}