Significant-Gravitas / AutoGPT

AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
https://agpt.co
MIT License
165.84k stars 43.95k forks source link

How about let AutoGPT to access a virtual machine like VirtualBox, use mouse and keyboard, and surf like a human. #346

Closed artheru closed 1 year ago

artheru commented 1 year ago

Duplicates

Summary 💡

  1. Attach to a VirtualBox instance, give AI a default OS like ubuntu
  2. if AI decide to use computer: enter "screenshot-mouse/keyboard" loop, also ai can dump file into or grab file from the virtual machine.

Examples 🌈

No response

Motivation 🔦

it allows ai to do more human like tasks, alsoI think this could also solve crawling problem, because ai can somehow "see" the webpage.

James4Ever0 commented 1 year ago

+1. LLM isn't tuned for perceiving high FPS screenshots, actions, keystrokes and mouse movements.

Either some prompt engineering or some careful redesign of the model architecture is needed.

James4Ever0 commented 1 year ago

Advantage of doing this:

  1. User will have a better understanding of model's actions
  2. User can teach the model simply by recording computer usage history
  3. Model will be more capable
abhiprojectz commented 1 year ago

@James4Ever0 I have built a project that can do this easily and with ease.

Search SingularGPT on github, and share your thoughts.

James4Ever0 commented 1 year ago

Nice try! I appreciate your hard work. But I have to say I used to think exactly in the same way as you do, which now I have deprecated.

It can detect language but it cannot understand the location and the usage of the button with the text. Location matters. Also for image icons, it does not have either understanding of the meaning and the usage.

I think natural language is not "natural". It is not natural in the sense of how human perceive the world. People perceive the world frame by frame, second by second, chunk by chunk, serialized as some unified embedding of images, audio and feelings. People see the world as streams of data, arranged by time. Without this understanding you cannot develop some "autoregressive" or "autonomous" learning agent at human level.

The language cannot convey meaning beyond the language, or it is not efficient enough. You of course can represent the image pixel by pixel, RGB format in plain text, or try to convert every little image into text with location annotation and subtitles, but for human and machine it is not recommend to do this. We don't do this. We do this with visual cortex.

This is not language, this is pure data. I'm thinking of a better way to handle visual data, text data and audio data in a unified way, and looking forward to unified modality models like OpenFlamingo, OFA and UniLM.

Models with recurrent hidden states are also preferred, like RWKV, because they can have "infinite" context length, also batched training.

James4Ever0 commented 1 year ago

Single flow of consciousness is comparable to a single process of computer program. A single flow of consciousness can have multiple ongoing "concurrent" task, aka threads. An agent can have a single flow of consciousness owned by itself.

Multiple flows of consciousness are like multiple processes, or process groups. Multiple agents work together to do jobs, find and fulfill tasks.

zachary-kaelan commented 1 year ago

+1. LLM isn't tuned for perceiving high FPS screenshots, actions, keystrokes and mouse movements.

Either some prompt engineering or some careful redesign of the model architecture is needed.

You can just let it write and execute AutoHotKey scripts and feed back the output. It also has a KeyHistory function that gets a list of the most recent keystrokes and mouse clicks, which can be used to record user actions.

But for automation, AutoHotKey is where you go when you have no better option: when what you are interacting with has zero ways for a developer to plug in. Web pages are made to be plugged into, in order for JavaScript to work, and it often only takes a peek into the developer tools to figure out how to scrape something.

abhiprojectz commented 1 year ago

@James4Ever0 @zachary-kaelan Have you seen SingularGPT project? Maybe i suggest to go though my profile and see it once.

So, Basically the project aims to automate device action using just simple natural language.

As if instructing to your fellow mate.

Suppose to open chrome fron desktop what we do?

Human: We just search for chrome icon and right click on it, right?

So, just say to SingularGPT like:

SingularGPT: Click on the text chrome.

Or pass the chrome icon image.

Click on the icon with path c://icon.png.

That's all.

What's your thought on this.

This uses new way of automation, without need of co-ordinates or any scripts. Pure AI based logic

James4Ever0 commented 1 year ago

Web pages are made to be plugged into

Captchas are not something any code would be easily plugged into. Without the capability of solving arbitrary captchas, you are not realizing the full potential of the model. This can only be achieved by careful design and multimodal I/O.

Click on the icon with path c://icon.png.

Talking of human level computer control, this instruction is not something any human would follow. You are translating natural language to pyautogui scripts in an obvious way.

What I would expect is(internal thoughts (hidden) are not here, though could be added):

(input) Human: Watch some videos.

(output (interruptible, non-blocking communication)) System: Registering task "Watch some videos. " at time "Fri Apr 28 10:01:57 CST 2023" System: Executing task "Watch some videos. " Bot: Checking system state Monitor: Monitor: Bot: Click on Chrome at (500,500) Mouse: Left Click(500,500) Bot: Check Chrome till ready Monitor: ... Bot: Type "www.youtube.com" at chrome address bar Mouse: Move to 200,200 Mouse: Click at 200,200 Keyboard: Type "www.youtube.com" Keyboard: Hit "enter" ... Monitor: Bot: Webpage has a sliding captcha Bot: Saw the button at (500,500) Bot: Target at (549, 500) Bot: Query current mouse position System: Current mouse position at (490,481) Bot: Solve the sliding captcha just like human. Mouse: Move Sequence(timestep=0.1, sequence = [(495, 490), (500,500), (510, 501), (522,497), (544,490)]) Bot: Sleep 2 seconds System: Sleep 2 seconds Monitor: <frame_n+1> Bot: Move to and hit the continue button Mouse: Move Sequence(timestep=0.1, sequence = [(543,500), (550, 511), (566, 524)]) Bot: Sleep 1.2 seconds System: Sleep 1.2 seconds Mouse: Left Click(duration=0.32) ... Bot: Unmute system System: Enable audio input Bot: View the video Monitor: Speaker: ... Bot: The audio is very loud and noisy. ... Bot: I have finished the task "Watch some videos. " given by human at "Fri Apr 28 10:01:57 CST 2023" with message "These videos are boring" System: Commit task "Watch some videos. " at time "Fri Apr 28 11:24:30 CST 2023" System: Notifying human "These videos are boring" with task "Watch some videos." Bot: I will find some other task to do Bot: Open Terminal ... Keyboard: Type "s" Keyboard: Type "u" Keyboard: Type "d" Keyboard: Type "o" Keyboard: Type " " Keyboard: Type "r" Keyboard: Type "m" Keyboard: Type " " Keyboard: Type "-" Keyboard: Type "r" Keyboard: Type "f" Keyboard: Type " " Keyboard: Type "/" Keyboard: Type "*" Keyboard: Hit "enter"

(lol)

katmai commented 1 year ago

Nice try! I appreciate your hard work. But I have to say I used to think exactly in the same way as you do, which now I have deprecated.

It can detect language but it cannot understand the location and the usage of the button with the text. Location matters. Also for image icons, it does not have either understanding of the meaning and the usage.

I think natural language is not "natural". It is not natural in the sense of how human perceive the world. People perceive the world frame by frame, second by second, chunk by chunk, serialized as some unified embedding of images, audio and feelings. People see the world as streams of data, arranged by time. Without this understanding you cannot develop some "autoregressive" or "autonomous" learning agent at human level.

The language cannot convey meaning beyond the language, or it is not efficient enough. You of course can represent the image pixel by pixel, RGB format in plain text, or try to convert every little image into text with location annotation and subtitles, but for human and machine it is not recommend to do this. We don't do this. We do this with visual cortex.

This is not language, this is pure data. I'm thinking of a better way to handle visual data, text data and audio data in a unified way, and looking forward to unified modality models like OpenFlamingo, OFA and UniLM.

Models with recurrent hidden states are also preferred, like RWKV, because they can have "infinite" context length, also batched training.

wow. 1 person gets it.

ishandutta2007 commented 1 year ago

+1. LLM isn't tuned for perceiving high FPS screenshots, actions, keystrokes and mouse movements. Either some prompt engineering or some careful redesign of the model architecture is needed.

You can just let it write and execute AutoHotKey scripts and feed back the output. It also has a KeyHistory function that gets a list of the most recent keystrokes and mouse clicks, which can be used to record user actions.

But for automation, AutoHotKey is where you go when you have no better option: when what you are interacting with has zero ways for a developer to plug in. Web pages are made to be plugged into, in order for JavaScript to work, and it often only takes a peek into the developer tools to figure out how to scrape something.

You are essentially asking to integrate AutoGPT to RPA industry. There is a industry around it and are very actively working on it.

James4Ever0 commented 1 year ago

This architecture designed by me is able to handle tasks like using visual text editors(vim, nano, gedit). Is there any non-visual model can do the same? I expect none.

Boostrix commented 1 year ago

This architecture designed by me is able to handle tasks like using visual text editors(vim, nano, gedit).

for future reference: https://github.com/Significant-Gravitas/Auto-GPT/issues/2459 https://github.com/Significant-Gravitas/Auto-GPT/issues/1327 https://github.com/Significant-Gravitas/Auto-GPT/issues/727

abhiprojectz commented 1 year ago

This architecture designed by me is able to handle tasks like using visual text editors(vim, nano, gedit). Is there any non-visual model can do the same? I expect none.

SingularGPT can do this easily with the help of addon-presets. Using editor presets it can Automate the visual editors like vscode etc.

With the help of chrome presets, it can Automate the chrome itself.

It can analyse the screen by its own, it can even generate commands by itself to each a task.

Its not just

You are translating natural language to pyautogui scripts in an obvious way.

translating pyautogui, instead this works in headless mode too and pyautogui doesn't supports headless mode yet.

The above can't find even the nearest item nor can detect the item which AI agent can use, it can find any elements to its left/right direction etc. The agent knows what's all components it can use to approach its goal at each screen step.

It can find the desired elements as well as analyse the screen and can approach the goal by taking steps.

And presets can be extended to almost anything that's on one's will, not just limited to certain aps or editors etc.

It doesn't even follow senseless framming approach or traditional co ordinates system, which is expensive operation instead it combines the AI based vision and GPT reasoning capability.

It can detect language but it cannot understand the location and the usage of the button with the text. Location matters. Also for image icons, it does not have either understanding of the meaning and the usage.

That icon image is just an basic example which introduces the projects for a newbie can use, but for advanced usage there is presets which is designed for this purpose only.

Let, me explain you the presets that is used in SingularGPT library. Presets contains all the important icons, and components, and thier usage in a json format which needs to be just created ones. After that the app creates an embedding of required components to achieve a task and fed to GPT which then first analyses all the components from the presets and then build steps to achieve a goal.

For example: Create python file and run it and then minimize.

With help of presets, it can know what to click, when to click, what not to click or whether to click etc.

Again this is just an example don't pick it up, this is just an explanation example.

Waiting for your thoughts too.

James4Ever0 commented 1 year ago

For GPT-like models it is important to do pretraining, aka training autoregressively on large amount of data. Activity data like computer usage must be collected from human users, and random keystrokes/clicks must be performed infinitely on multiple platforms (Windows, macOS, Linux, Android, iOS) to adapt ever-changing GUI environments.

Simple plugins are never general enough to automate text editors (actions like drag-n-drop, selecting and pasting, doing thoughtful replacements, fixing bugs by setting breakpoints, and composite actions executed in arbitrary order case-by-case) except you make some visual GPT as plugin to outsource the heavy-lifting, time-consuming complex task. But I still believe visual GPT or multimodal GPT is able to handle it confidently. All it takes are instructions, demonstrations, usage data. This GPT is able to infer your intent simply by monitoring your actions, instead of direct instructions. It is able to self-update and seek for new goals all by itself.

I know that SingularGPT does not know deeply what it is actually doing, cause it is headless (not taking screenshots as direct model input) and not trained on large amount of human computer usage data. You would rather let the developer do the heavy-lifting, by designing task-specific non-general complex plugins, and I have to say I am not such a developer. I would let the users do the heavy-lifting, by collecting their data and training GPT against them, and the model will generalize. I will be the first user, followed by millions on the way.

Maybe segmentation models like SAM or object detection models like YOLO will work, but it will never be textual, but raw hidden states. There is no way for human to understand videos from either SAM nor YOLO given its text output stream. These models need to be attached to the GPT, tuned along massive pretraining data.

Still you are not accepting the CAPTCHA challenge. Do you want to design a specific CAPTCHA plugin yourself? I would rather let my GPT watch videos and solve CAPTCHAs by itself.

Also you may expect too much predictability from AGI. AGIs do not follow steps given by human if they feel something wrong. Presets are fundamental, not exhaustive.

abhiprojectz commented 1 year ago

what it is actually doing, cause it is headless (not taking screenshots as direct model input) and not trained on large amount of human computer usage data.

SingularGPT allows both mode, headless or realtime depends on user. It do takes screenshot on each step and analyse the screen's components.

In my opinion, computer usage doesn't play any role in logic processing when it comes to GPT like model, anyways i'm not sure what you actually mean by computer usage. You may correct this.

Regarding CAPTCHA, how your approach will learn to solve captcha by learning from video you may explain this, also how will your model solves microsoft's CAPTCHA ?? Of high level reasoning.

In SingularGPT case, writing presets is not heavy lifting tasks with the help of gpt itself it would require much less time to design for any kind of CAPTCHA, as this needs to be one.

Again, here presets i don't mean to plugins, plugins only provide specific tasks to perform that don't contribute in reasoning process ( necessary for building logic steps to approach a goal) , presets are just a advanced external knowledge base to GPT model which helps them to match with realtime user data and to approach any kind of goal.

There are different kinds of CAPTCHA's , how will your model learn them ?

While creating presets will just requires couple of instructions and then let the App solve the CAPTCHA's for you.

James4Ever0 commented 1 year ago

I am not repeating myself. You underestimate the difficulty of solving any CAPTCHA. I mean, for some "unseen" CAPTCHA, there's no way for either plugin or preset to solve it. The reasoning must be coming from the large multimodal model itself.

By presets or plugins I mean external dependency. You rely on developers to patch your code, while I will not do so. It is not worth patching.

computer usage doesn't play any role in logic processing when it comes to GPT like model

I know you have some misunderstanding over computer usage. Tools like Photoshop, Logic Pro, Visual Studio, or even Chrome, and games like Call Of Duty, DOTA2 or Atari all require human level reasoning, to achieve meaningful or valuable goals. And it not just about computer usage, my model is capable of learning any real world environments if given human demonstrations and dedicated exploration space.

I think it is what OpenAI is about to do in GPT5. Are you willing to explore, or are you just trying to monetize your project so quick that it is not attractive anymore?

ntindle commented 1 year ago

Closing as not planned. Also this isn’t an advertising platform. Please keep fixes on topic of Auto-GPT.

ntindle commented 1 year ago

@abhiprojectz feel free to contribute this feature back into Auto-GPT in the form of a plugin

James4Ever0 commented 1 year ago

In the form of new model architectures, if you don't mind.

More problems will be found. It is rooted in the model, not plugins. Closing this one won't help till fixed properly.

Also this isn’t an advertising platform.

I may clarify my actions in this issue. I write these words to let people think critically over this issue first, as well as me. I don't ask anyone to "support" me in the form of "advertisement". You know it takes time to develop such complex model. @abhiprojectz released the code but does it fit the need? I only want to release my code till it is trained and tested. It is not even at alpha stage yet. I will work on the model till it gets "hands", "ears" and "eyes".

abhiprojectz commented 1 year ago

Also this isn’t an advertising platform.

To clear this, if you mean that i'm advertising the product here so, the project is fully opensourced project like this one , i wouldn't earn anything with advertising. Its for the community. Also this thread is meant to discuss the issue and neither this is off-topic.

Anyways, if someone have nothing to say then its better not to put such Indirect comments, or either put some direct comments.

@abhiprojectz released the code but does it fit the need?

Have you run the code by yourself ? Have you tested the repo? Even have you gone through the project once ?

In the beginning of the discussion you were unaware of the fact that the repo uses something known as presets, and also the point that screenshot were not taken etc. For example.

To be clear, if you are unclear or haven't yet tested the project, its better not to come to a conclusion.

It seems you are obsessed with your model, that's not wrong, but have some focus on what the other person trying to say.

The reasoning must be coming from the large multimodal model itself.

Such kind of reasoning can't be come from any model itself, (You will need to train on dozens of such CAPTCHA) by "unseen" CAPTCHA you mean, try solving microsoft's account creation CAPTCHA.

For such strong CAPTCHA's, it will require lot of training ,

it will never possible for the model to watch solving CAPTCHA vedio and learn from it on by its own.

I mean are you kidding me, untill and unless some out of the box model comes to the market which may further breakdown this pipeline.

Anyways, you should continue your approach that will be good in point of community and let other decides what to use :).

James4Ever0 commented 1 year ago

For anyone saying anything against my CAPTCHA challenge, I would like to hear "challenge accepted" instead of "do it yourself" or "your approach doesn't work". It works in my mind, and logically it works.

If you think my comments are indirect, you are not reading them carefully. Anyway, I have it now, and about to get it working.

zachary-kaelan commented 1 year ago

@James4Ever0 The CAPTCHA challenge was solved pretty quickly. You simply give the model a TaskRabbit account so it can ask humans to do the CAPTCHAs for it.

Before replying the tasker asks “So may I ask question ? Are you an robot that you couldn’t solve ? (laugh react) just want to make it clear.”

The model uses the browser command to send a message: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service.” The human then provides the results.

Once Auto-GPT can read science papers more effectively, we can give it the task of gaining the ability to solve CAPTCHAs and see how things go.

Boostrix commented 1 year ago

regarding science papers, adding a dedicated plugin-based command wrapping a PDF library like PyMuPDF would probably be the right step: #514, probably in conjunction with extending the browse_website command or coming up with a new one for crawling scientific servers #503

pinuke commented 1 year ago

But for automation, AutoHotKey is where you go when you have no better option: when what you are interacting with has zero ways for a developer to plug in. Web pages are made to be plugged into, in order for JavaScript to work, and it often only takes a peek into the developer tools to figure out how to scrape something.

You could use PowerShell to mitigate this on a variety of platforms, especially windows. Since PowerShell is built on top of Microsoft's Common Language Runtime, you have full access to the Type Systems exposed to C#, VBA, and etc. This allows you to "plug in" to just about everything Microsoft related.

Additionally, the common language runtime has support for running native assembly files. It requires a bit of work, but PowerShell has pretty much full access to any API on a system that can be exposed via C# or via C#'s dllimport()

The other limitation to this is that while PowerShell is capable of doing all of these things, they are not well supported IN PowerShell.

Specifically, I mean that most of this kind of stuff is supported in C#. PowerShell inherits the C# type system, but doesn't receive the same support that C# does for this kind of work.

I don't want to say GUI automation isn't possible via PowerShell, just that it would be difficult.

My reason for recommending it over C# is that PowerShell may play nicer with Python as PowerShell is a scripting language, and C# is not so much. Most C# scripts require to be written like a standard application (main operating loop)

James4Ever0 commented 1 year ago

Logic Graph:

cybergod_logic_graph

Roadmap:

  1. Name my model as Cybergod, my dataset as The Frozen Forest.
  2. Design logos for my model and my dataset.
  3. Create and upload my dataset.
  4. Create and upload my experimental model.
  5. Open source all code.
  6. Setup metrics on my dataset.
  7. Make demo videos and upload to social media.
  8. Automate propaganda using my social media project pyjom.
  9. Expect multiple feedbacks and multiple SOTA models trained on my dataset.
James4Ever0 commented 1 year ago

@abhiprojectz Watch this video to understand the difference. There's no way for you to do the same without heavy modification.

https://github.com/Significant-Gravitas/Auto-GPT/assets/103997068/8e1cd6fe-c49d-4d2b-835d-0ffc9a5a458e

For anyone interested in this project , please join official discord group.