OpenAdaptAI / OpenAdapt

AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
https://www.OpenAdapt.AI
MIT License
771 stars 104 forks source link

Explore Grounding Dino #288

Open abrichr opened 1 year ago

abrichr commented 1 year ago

Feature request

How can we take advantage of https://github.com/IDEA-Research/GroundingDINO ? How does it compare with SegmentAnything (https://github.com/MLDSAI/OpenAdapt/issues/15 / https://github.com/MLDSAI/OpenAdapt/blob/main/openadapt/strategies/mixins/sam.py ) ?

https://arxiv.org/abs/2303.05499

https://huggingface.co/spaces/ShilongLiu/Grounding_DINO_demo

Motivation

https://github.com/MLDSAI/OpenAdapt/pull/174#issuecomment-1595852156

FFFiend commented 1 year ago

Grounding DINO is trained on the COCO dataset which comprises largely of open-world-centric images. From what I tried out when I used the space, object classification for GUI screenshots was poor.

FFFiend commented 1 year ago

Grounding Dino on GUI data

Screen Shot 2023-06-19 at 12 11 15 AM

Result:

Screen Shot 2023-06-19 at 12 13 00 AM

Steps to reproduce: I used default box and text threshold values, 0.25 each and "browser tab" as the detection prompt.

abrichr commented 1 year ago

@FFFiend that's not bad for a first try out of the box!

Can you please run this through some list of user interface component names? e.g. https://chat.openai.com/share/bcecf257-500e-446f-90ba-2ca5713de34d

FFFiend commented 1 year ago

Prompt: "window, tab, panel, menu, submenu, button, icon, text field, text area, search bar,logo" Here's a couple runs on the same image with their respective box and text threshold values:

image(2) Box: 0.25, Text: 0.25

image(3) Box: 0.125, Text: 0.125

Promising results! Although I imagine rate of wrongly classifying things is higher since the threshold values are halved.

image(4) Box: 0.0625, Text: 0.0625

image(5) Box: 0.092, Text: 0.092

abrichr commented 1 year ago

@FFFiend is it possible to modify the temperature, such that we can run inference multiple times expecting different results?

FFFiend commented 1 year ago

I searched around a few repos and couldn't find anything that enabled us to modify the temperature, no. Seems like text and box threshold are the only params they let us modify.

I did find this however, https://huggingface.co/spaces/yizhangliu/Grounded-Segment-Anything a combination of GD and SAM on a HF Space.

Another relevant repo: https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once