Weird inference time for grounding_dino with vit_h and vit_tiny

Hello! Thank you for your great work.

Recently, I tested several given code like "grounded_light_hqsam" and "grounded_sam_simple_demo". And there is some weird results for following code.

(First part) detections = grounding_dino_model.predict_with_classes( image=image, classes=CLASSES, box_threshold=BOX_THRESHOLD, text_threshold=BOX_THRESHOLD )

(Second part) detections.mask = segment( sam_predictor=sam_predictor, image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), xyxy=detections.xyxy )

For grounded_light_hqsam using "vit_h" for sam encoder, first part takes 1.574 second and second part takes 0.611 second. And for grounded_sam_simple_demo using "vit_tiny", first part takes 2.177 second and second part takes 0.136 second.

In my opinion, the shorter time for second part is okay because vit_tiny is light model. But I have no idea why the first part takes more time for vit_tiny.

I want to use these model in real-time, so I want it to take a shorter time. I would appreciate it if you could give me some advice on why this result came out and how to shorten the time.

Thank you!

IDEA-Research / Grounded-Segment-Anything

Weird inference time for grounding_dino with vit_h and vit_tiny #470