OpenAdaptAI / OpenAdapt

Open Source Generative Process Automation (i.e. Generative RPA). AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
https://www.OpenAdapt.AI
MIT License
883 stars 115 forks source link

Implement vanilla baseline #702

Closed abrichr closed 4 months ago

abrichr commented 4 months ago

Addresses https://github.com/OpenAdaptAI/OpenAdapt/issues/700:

Send in a series of screenshots to GPT-4 and then ask GPT-4 to describe what happened. Then give it the sequence of actions (in concrete coordinates and keyboard inputs), as well as your proposed modification in natural language. Ask it to output the new action sequence. ... If AGI or GPT6 happens, this script should be able to suddenly do the work.

@LunjunZhang thank you for your suggestions! 🙏

abrichr commented 4 months ago

Send in a series of screenshots to GPT-4 and then ask GPT-4 to describe what happened. Then give it the sequence of actions (in concrete coordinates and keyboard inputs), as well as your proposed modification in natural language.

According to this description the model would have no knowledge of the current state, it would in effect be predicting what the future state will be, then we replay its predictions verbatim

If you already have state representation, maybe add [the current state to the prompt at every time step] --LunjunZhang

abrichr commented 4 months ago

Interesting failure mode:

                                    "Consider the actions you've produced (and "                                                                                                                                                                          
                                    'we have replayed) so far:\n'                                                                                                                                                                                         
                                    '\n'                                                                                     
                                    '```json\n'                                                                              
                                    "[{'name': 'type', 'text': "                                                                                                                                                                                          
                                    "'<cmd>-<space>', 'canonical_text': ''}, "                                                                                                                                                                            
                                    "{'name': 'type', 'text': "                                                                                                                                                                                           
                                    "'c-a-l-c-u-l-a-t-o-r', 'canonical_text': "                                              
                                    "''}, {'name': 'type', 'text': '<enter>', "                                                                                                                                                                           
                                    "'canonical_text': ''}, {'name': 'type', "                                                                                                                                                                            
                                    "'text': '9-8', 'canonical_text': ''}, "                                                                                                                                                                              
                                    "{'name': 'type', 'text': '-', "                                                                                                                                                                                      
                                    "'canonical_text': ''}, {'name': 'type', "                                                                                                                                                                            
                                    "'text': '<enter>', 'canonical_text': ''}, "                                                                                                                                                                          
                                    "{'name': 'type', 'text': '8', "                                                                                                                                                                                      
                                    "'canonical_text': ''}, {'name': 'type', "                                               
                                    "'text': '=', 'canonical_text': ''}, "                                                                                                                                                                                
                                    "{'name': 'type', 'text': '8', "                                                                                                                                                                                      
                                    "'canonical_text': ''}, {'name': 'type', "                                                                                                                                                                            
                                    "'text': '<-m->-a', 'canonical_text': "                                                                                                                                                                               
                                    "''}]\n"                                                                                                                                                                                                              
                                    '```\n'
  File "/Users/abrichr/oa/OpenAdapt/openadapt/playback.py", line 69, in play_key_event                                       
    keyboard_controller.press(key)                                                                                           
    |                   |     -> None                                                                                                                                                                                                                     
    |                   -> <function Controller.press at 0x106023490>                                                        
    -> <oa_pynput.keyboard._darwin.Controller object at 0x2d196b880>                                                                                                                                                                                      

  File "/Users/abrichr/Library/Caches/pypoetry/virtualenvs/openadapt-VBXg4jpm-py3.10/lib/python3.10/site-packages/oa_pynput/keyboard/_base.py", line 370, in press
    raise self.InvalidKeyException(key)                                                                                                                                                                                                                   
          |    |                   -> None                                                                                   
          |    -> <class 'oa_pynput.keyboard._base.Controller.InvalidKeyException'>                                          
          -> <oa_pynput.keyboard._darwin.Controller object at 0x2d196b880>                                                   

oa_pynput.keyboard._base.Controller.InvalidKeyException: None  
abrichr commented 4 months ago

In general, prompting the model with natural language instructions to modify the recording fails.

Recording (with non-keyframes discarded by default to improve write performance; download and open with VLC to watch):

https://github.com/OpenAdaptAI/OpenAdapt/assets/774615/c3101fe6-143d-47cc-bd49-2504db1d4b38

Recording description:

('The user initiates the sequence by activating Spotlight Search using the '
 'shortcut command (Command+Space). They then typed "calculator" to search for '
 'the Calculator application on a MacOS system, and confirmed the selection by '
 'pressing the Enter key, which launched the application.\n'
 '\n'
 'Subsequent actions involve a series of mouse movements and clicks within the '
 'Calculator app:\n'
 '\n'
 '1. The user moved the mouse cursor to coordinates (927.671875, 188.546875) '
 "and clicked there. The first mouse click doesn't specify any operation in "
 'the calculator, so its purpose seems like setting focus or a misclick.\n'
 '\n'
 '2. The next movement was to coordinates (974.6875, 338.75), followed by a '
 'mouse click. These actions correspond to clicking the number "6" on the '
 'calculator.\n'
 '\n'
 '3. The cursor was then moved to coordinates (1086.08984375, 236.22265625) '
 'and clicked, corresponding to the operator "+" in the calculator.\n'
 '\n'
 '4. Next, the mouse moved to coordinates (1046.96484375, 320.12109375) and '
 'clicked, likely pressing the number "3" on the calculator.\n'
 '\n'
 '5. Finally, the cursor was relocated to (1078.22265625, 379.83203125) and '
 'clicked, which triggers the "=" operation, producing the output "9" on the '
 'calculator display as shown in the final screenshots.\n'
 '\n'
 'Each action is verified by corresponding screenshots that also include '
 'temporal stamps, proving a structured and deliberate sequence of tasks, '
 'resulting in performing a basic addition operation of "6 + 3" to get a '
 'result "9" on the calculator.\n')
2:50 p.m.

VanillaReplayStrategy:

python -m openadapt.replay VanillaReplayStrategy --replay_instructions "calculate 9-8" --record
...
2024-06-04 17:07:28.342 | INFO     | openadapt.strategies.vanilla:__del__:104 - action_history=
[{'canonical_text': '', 'name': 'type', 'text': '<cmd>-<space>'},
 {'canonical_text': '', 'name': 'type', 'text': 'c-a-l-c-u-l-a-t-o-r'},
 {'canonical_text': '', 'name': 'type', 'text': '<enter>'},
 {'canonical_text': '', 'name': 'type', 'text': '<enter>'},
 {'canonical_text': '', 'name': 'type', 'text': '9-8'}]

https://github.com/OpenAdaptAI/OpenAdapt/assets/774615/cd8ab50c-2a90-4c0f-98fe-080b58bc3271