IDEA-Research / Grounded-Segment-Anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
https://arxiv.org/abs/2401.14159
Apache License 2.0
14.11k stars 1.31k forks source link

Weird inference time for grounding_dino with vit_h and vit_tiny #470

Open stupidyoh opened 3 months ago

stupidyoh commented 3 months ago

Hello! Thank you for your great work.

Recently, I tested several given code like "grounded_light_hqsam" and "grounded_sam_simple_demo". And there is some weird results for following code.

(First part) detections = grounding_dino_model.predict_with_classes( image=image, classes=CLASSES, box_threshold=BOX_THRESHOLD, text_threshold=BOX_THRESHOLD )

(Second part) detections.mask = segment( sam_predictor=sam_predictor, image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), xyxy=detections.xyxy )

For grounded_light_hqsam using "vit_h" for sam encoder, first part takes 1.574 second and second part takes 0.611 second. And for grounded_sam_simple_demo using "vit_tiny", first part takes 2.177 second and second part takes 0.136 second.

In my opinion, the shorter time for second part is okay because vit_tiny is light model. But I have no idea why the first part takes more time for vit_tiny.

I want to use these model in real-time, so I want it to take a shorter time. I would appreciate it if you could give me some advice on why this result came out and how to shorten the time.

Thank you!

stupidyoh commented 3 months ago

I'm sorry. It takes different time for every single test. But the deviation is larger than I thought.