OthersideAI / self-operating-computer

A framework to enable multimodal models to operate a computer.
https://www.hyperwriteai.com/self-operating-computer
MIT License
8.68k stars 1.15k forks source link

[Need Help] Issues Encountered with gpt4o When Using the Method of Marking Images #200

Open tears743 opened 3 months ago

tears743 commented 3 months ago

Hi, thank you for your open-source efforts, this repository is fantastic!

I am currently using OCR + Segment Anything along with some simple algorithms to mark screenshots. Here is the marked image label_sam_ocr_20240703-110228, and the marking effect looks quite good.

I have been running the entire operation using the langchain agent + gpt4o method, and I am encountering some issues that you may have faced before. I am not sure if there are any good methods to handle these, or if there are any possible causes that would be helpful to know.

  1. The accuracy of gpt4o in locating the corresponding labels is very low (I am not sure if this is related to the prompt or the invocation of gpt4o). With the same image and prompt, gpt4o often locates the wrong mark, occasionally it is correct. I have modified many versions of the prompt, but there is basically no prompt that can effectively improve this situation, and this issue is almost driving me crazy.

  2. Building on the above issue, in multiple rounds of dialogue, even though the prompt emphasizes the content of "check whether the last operation was successful based on the screenshot," and the image passed is also the latest screenshot, gpt4o seems to be delusional and does not really check, but often hallucinates that the operation has been successful.

The above two issues are almost driving me crazy, I don't know what to do next to improve the above two issues. I don't know if you have encountered the above issues, and if there are any ideas?