OthersideAI / self-operating-computer

A framework to enable multimodal models to operate a computer.
https://www.hyperwriteai.com/self-operating-computer
MIT License
8.56k stars 1.13k forks source link

[FEATURE] Mouse position calibration #146

Open osama-salah opened 7 months ago

osama-salah commented 7 months ago

A known issue is that the detected position of the mouse is not accurate. Just as a workaround, could it be calibrated? A screen shoot could be captured, the mouse pointer is then detected, its position is calculated, the mouse is moved to a different position, and the process could be repeated until the position accuracy is enhanced.

joshbickett commented 7 months ago

@osama-salah have you tried operate -m gpt-4-with-ocr? With the OCR approach the click X & Y are now spot on based on what GPT-4-v decided to click

osama-salah commented 7 months ago

@joshbickett I use Gemini-pro-vision as I don't have ChatGPT Plus subscription.

mrkhalil6 commented 7 months ago

@joshbickett I am on windows 10, I am using it with "operate -m gpt-4-with-ocr" and "operate" but in both ways, it couldn't click on exact spot. Is there any specific resolution which I should set my screen size to ?

joshbickett commented 7 months ago

@osama-salah oh ok. We could add gemini with OCR because OCR uses an open source license that doesn't require a key. If someone could make a PR for that, it'd be great!

joshbickett commented 7 months ago

@mrkhalil6 if the button or link to click doesn't have text then it will likely fail. Was it "missing" the button or just didn't know what to click?