Open abrichr opened 1 year ago
Grounding DINO is trained on the COCO dataset which comprises largely of open-world-centric images. From what I tried out when I used the space, object classification for GUI screenshots was poor.
Grounding Dino on GUI data
Result:
Steps to reproduce: I used default box and text threshold values, 0.25 each and "browser tab" as the detection prompt.
@FFFiend that's not bad for a first try out of the box!
Can you please run this through some list of user interface component names? e.g. https://chat.openai.com/share/bcecf257-500e-446f-90ba-2ca5713de34d
Prompt: "window, tab, panel, menu, submenu, button, icon, text field, text area, search bar,logo" Here's a couple runs on the same image with their respective box and text threshold values:
Box: 0.25, Text: 0.25
Box: 0.125, Text: 0.125
Promising results! Although I imagine rate of wrongly classifying things is higher since the threshold values are halved.
Box: 0.0625, Text: 0.0625
Box: 0.092, Text: 0.092
@FFFiend is it possible to modify the temperature, such that we can run inference multiple times expecting different results?
I searched around a few repos and couldn't find anything that enabled us to modify the temperature, no. Seems like text and box threshold are the only params they let us modify.
I did find this however, https://huggingface.co/spaces/yizhangliu/Grounded-Segment-Anything a combination of GD and SAM on a HF Space.
Another relevant repo: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once
Feature request
How can we take advantage of https://github.com/IDEA-Research/GroundingDINO ? How does it compare with SegmentAnything (https://github.com/MLDSAI/OpenAdapt/issues/15 / https://github.com/MLDSAI/OpenAdapt/blob/main/openadapt/strategies/mixins/sam.py ) ?
https://arxiv.org/abs/2303.05499
https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo
Motivation
https://github.com/MLDSAI/OpenAdapt/pull/174#issuecomment-1595852156