OthersideAI / self-operating-computer

A framework to enable multimodal models to operate a computer.
https://www.hyperwriteai.com/self-operating-computer
MIT License
8.61k stars 1.14k forks source link

Integrate Set-of-Mark Visual Prompting for GPT-4V #3

Closed 0xdevalias closed 8 months ago

0xdevalias commented 9 months ago

I noticed that you currently seem to apply a grid to the images to assist the vision model:

And mention this in the README:

Current Challenges Note: The GPT-4v's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

I was wondering, have you looked at using Set-of-Mark Prompting Visual Prompting for GPT-4V / similar techniques?

See Also

A bit of a link dump from one of my references:

0xdevalias commented 9 months ago

See also:

Daisuke134 commented 9 months ago

I am trying to implement SoM, since it seems to have the best accuracy.

joshbickett commented 9 months ago

@Daisuke134 interested to see what you find. I'm going to go learn more about SoM

joshbickett commented 9 months ago

@0xdevalias read up more on SoM. It looks like a very promising approach, thank you for opening this issue!

https://github.com/microsoft/SoM

0xdevalias commented 9 months ago

read up more on SoM. It looks like a very promising approach, thank you for opening this issue!

@joshbickett No worries :)

Daisuke134 commented 9 months ago

I have been testing out SoM and seems pretty good. Here is the screenshot.. I will try adding this today, test it, and make PR.

image

Daisuke134 commented 9 months ago

I am implementing SoM now, and seems like the best way is to make another mode like som-mode and make a new prompt for the mode.

joshbickett commented 8 months ago

@Daisuke134 @0xdevalias Set-of-Mark prompting is now available. Swap in your best.pt from a YOLOv8 model and see how it performs!