Open waynehamadi opened 10 months ago
So far the solution was to give secrets to the agent. The day we have more sensitive information, it's going to become a problem.
Question: what do you consider to be the agent? Is it the entire application, or just the part involving LLMs to perform logic?
Taking Auto-GPT as an example, the secrets are never shared with the Agent
. The agent proposes an action, and on execution, any necessary secrets are provided to the action from the application's configuration:
Question: is this workflow problematic? And why?
I think the answer would be:
The proposal above is essentially to take the grey blob + authentication in this picture out of the agent application's domain:
This is an indirect solution for the problem at hand, and this solution introduces its own complications and limitations to developing an agent, so we should explore other solutions first.
The crux of the problem is that at some point the user has to authenticate and/or authorize actions on their behalf. The Agent Protocol does not provide a mechanism for this.
Example: OAuth2 could be helpful, but this would require a hosted service which passes the obtained credentials to the application running the agent. This would make it somewhat more complicated for locally running applications.
If we assume that the agent is hosted as a cloud service, this could be a clean solution:
Let's consider pulling execution of actions out of the domain of the agent, as illustrated above.
I don't think there is a one-size-fits-all solution here, because of the variety of possible actions:
Additional considerations:
Some agents may be able to function without category 3 and/or category 2 actions.
Category 3 actions (internal) do not have to be considered for standardization since they are limited to the internal process of the agent.
Outsourcing execution to a remote service would support only category 1 actions.
Outsourcing execution to a local service would support both category 1 and category 2 actions.
Doing everything locally complicates the use of established authentication mechanisms such as OAuth2. Right? (I'm not an expert on this)
A few thoughts:
What if we supported the three ideas outlined (as I read them): 1 - Optional capability sharing 2 - Optionally relying on the client to fulfill (and authenticate) specific tasks 3 - Optionally enabling delegated authentication via well-established auth protocols
Authentication aside, it would be highly potent for the protocol to allow clients to indicate their ability and willingness to handle specific tasks. Clients may even want to demand that they perform particular jobs themselves. (CAN vs. PREFER vs. DEMAND)
Borrowing from HTTP - Accepts headers, client fingerprinting, etc., enables a lot of helpful user functionality.
Additionally, allowing servers to advertise their capabilities similarly is low-hanging fruit that would expand the protocol in exciting ways, especially in complex systems with multiple agents and skills, where the line between the server (agent), client, and skill gets blurred.
Real-world agents are servers and clients, and one can imagine a chain of agents with the LLM at the end (which some people suggest is also a client with several LLMs behind it). Similarly, skills might be local or remote, and if remote, who is to say there's no agent on the other end?
Crucially, these "advertisements" should be optional, and a server or client doesn't have to support them (except in the "I DEMAND" scenario)
I think this answers the "Where to execute" question because the client can decide to execute if it has the capability (directly or via plugins or orchestration), but by default, it's the server that decides how to fulfill the task with its available skills, whether those skills are local or via some plugin mechanism, or orchestration, or third party services.
TLDR; What if the protocol introduced X-Client-Capabilities and X-Server-Capabilities headers?
HTTP supports a lot of the functionality that Open AI functions exhibit, including bi-directional capability exchange via such headers as HTTP ACCEPT headers, and crucially, it supports custom headers, which are often used for this purpose (via X-My-Custom-Header)
Then we've got:
Separately - OAuth2 and SAML are proven technologies for hiding credentials from applications, with lots of drop-in implementations, and can delegate authorization in local scenarios.
The workflow would be much like GitHub Desktop, the gh command line client, or Google's command line tool in local scenarios.
IMO, intermediaries should support the authentication protocol as proxies between the upstream services and the clients. They should only try to be OAuth2 providers (or SAML providers) if they provide the service in question.
Of course, if an upstream server doesn't support Oauth2 or SAML, an intermediary could act as an authentication server, but it'll have to contend with gaining the user's trust.
Agent Function Protocol
Motivation
People want to send emails with agents. But to send emails on behalf of someone how do you do it ? So far the solution was to give secrets to the agent. The day we have more sensitive information, it's going to become a problem.
Imagine you have an agent that gives a secret to another to perform a task. At some point you end up with 10 agents reading your secrets. It's just asking for trouble. There is no way anyone will do sensitive actions with an agent (think about paying something on amazon, for example).
That's a shame because these sensitive actions are also the core of the agentic space: if the agent can only toy around with a local file system, then what's the point ?
So how do we actually give the agent the ability to do things on my behalf in my gmail account, linkedin account, amazon account ? (even bank account, let's be crazy)
Agent Builders Benefit
As agent builders, how do we do send emails on gmail, for example ? Do we all create a method for that ? Then we have to make sure our client knows where to put its api key ? And then any time we need a new action (like for example archiving an email), do we actually write this method again ?
And now imagine you want to do things on ann outlook email ? Do you also do it there ? It might have different ways to authenticate. You pretty much need to build everything in house. And we're all doing this at the moment.
Design Proposal
Ok, so instead of doing the action for the client. Let's just tell the user what we want to do. In continuous mode the client will do it automatically without human in the loop. And in manual mode it will ask user's permission to continue.
So in REST (and obviously I know we want to support more web protocols, such as graphQL and websocket), we can literally just copy OpenAI functions:
And then the client decides makes this sensitive action. This assumes clients that are able to do things. This is an opportunity for us to build a python or javascript client specialized in taking actions, and make this Open Source.
We can then pretty much standardize actions.
I know we're going to have 1 million actions. but it's better than having 10 millions people all working on 10 different types of actions for their agents.
Alternatives Considered
Maybe we can give the secrets to the agent and let it do its thing ? We just let each agent creator create and maintain all these actions ? I think this is pretty hard to do and on top of that, if an agent starts having secrets it could share them to subagents, and now it's a mess.
Compatibility
It's actually backwards compatible