For grounded_light_hqsam using "vit_h" for sam encoder, first part takes 1.574 second and second part takes 0.611 second.
And for grounded_sam_simple_demo using "vit_tiny", first part takes 2.177 second and second part takes 0.136 second.
In my opinion, the shorter time for second part is okay because vit_tiny is light model.
But I have no idea why the first part takes more time for vit_tiny.
I want to use these model in real-time, so I want it to take a shorter time.
I would appreciate it if you could give me some advice on why this result came out and how to shorten the time.
Hello! Thank you for your great work.
Recently, I tested several given code like "grounded_light_hqsam" and "grounded_sam_simple_demo". And there is some weird results for following code.
(First part) detections = grounding_dino_model.predict_with_classes( image=image, classes=CLASSES, box_threshold=BOX_THRESHOLD, text_threshold=BOX_THRESHOLD )
(Second part) detections.mask = segment( sam_predictor=sam_predictor, image=cv2.cvtColor(image, cv2.COLOR_BGR2RGB), xyxy=detections.xyxy )
For grounded_light_hqsam using "vit_h" for sam encoder, first part takes 1.574 second and second part takes 0.611 second. And for grounded_sam_simple_demo using "vit_tiny", first part takes 2.177 second and second part takes 0.136 second.
In my opinion, the shorter time for second part is okay because vit_tiny is light model. But I have no idea why the first part takes more time for vit_tiny.
I want to use these model in real-time, so I want it to take a shorter time. I would appreciate it if you could give me some advice on why this result came out and how to shorten the time.
Thank you!