dglazkov / polymath

MIT License
133 stars 9 forks source link

Figure out ways to discourage scraping #94

Open jkomoros opened 1 year ago

jkomoros commented 1 year ago

Originally explored a bit in #26 but we might want to do something more.

The point of a Polymath endpoint is not "scrape up all my bits of content" it's "select relevant bits of content for the purpose of doing a polymath query." There's no way to fully prevent the former, but we can make it very, very clear that that is Not What You Should Be Doing, by documenting clearly that that is not what a Polymath host is opting into, using things like robots.txt, and just in general requiring would-be-scrapers to crawl through some broken glass to scrape, so they might question whether maybe they just shouldn't.

There are a number of things we might do, and I might explore them in this issue at some point.

But a few no brainers:

jkomoros commented 1 year ago

OK, developing a wild, likely over-complex idea that might not be workable. Building on an idea already sketched out a bit in #26.

Conceptually, the architecture of all things that fit into the polymath universe look like:

The simplest polymath is a simple CLI that works with a mixer locally that talks to a host that just loads up all of the library files in libraries/, but it can get considerably more complex, where each part is on a different computer.

OK, let's layer on top of this a scheme that makes scraping hard. (Remember: a clever user can always extract bits of content with a cleverly constructed query to do prompt injection. Our goal is just to make accidental or faux-accidental scraping harder.)

First, note that what a mixer does is actually pretty simple. It takes a query from the UI, a secret OpenAI_API_Key it knows or is provided from the end-user by the ui, reaches out to a number of hosts to fetch bits of content, and then formulates a completion to pass to OpenAI. It's desirable that the mixer not actually pass the direct bits of content back up to the UI, since the ui might be a scraper. The bits of content comes in from the host, gets remixed into a full completion, goes to OpenAI, and the completion--but not the raw bits of content--is piped back to the UI.

Typically we imagine that whoever hosts the ui is also hosting the mixer (and sometimes the mixer is literally running clientside in a webapp UI). But imagine a headless mixer being operated by a third party service as a generic piece of infrastructured at a known location., e.g https://mixer.polymath.community. It is totally stateless. With each request it takes a user's query, an OpenAI key to use, a list of endpoints to reach out to (possibly with access tokens). Mixers have to be trusted by both hosts and uis. Hosts have to trust that it will not pass the unencrypted bits of content to anyone but OpenAI directly, and UIs have to trust that the mixer won't steal or store the user's OpenAI key or access tokens.

A mixer would have a public key, published in a known location relative to its endpoint, and keep its private key secret.

Let's imagine that certain hosts might insist on encrypting their bits of content that they return to one of the public keys of a set of mixers they enumerate, so that untrusted clients reaching out (e.g. a UI reaching out directly) will not be able to read the content. Only trusted mixers will be able to decrypt the content.

If the ui reaches out directly to a host that insists on encryption, it would get back a response indicating that the host requires using a pre-enumerated mixer. The ui would then choose which of the mixers it also trusted (remember that both the host and the ui need to trust the mixer in different ways) and reach out to it and ask it to reach out to the host on its behalf. The mixer would then reach out to the host and request that the content the host returns be encrypted by its public-key. The host would see that the mixer/public-key pair is one of its pre-enumerated trusted ones and encrypt the content. The mixer can then use its private key to decrypt the content, mix it in with content from other hosts, pass the whole prompt to OpenAI, and return the result to the UI.

This scheme would require uis to enumerate mixers they trust and are willing to work with, and hosts to also enumerate mixers they are willing to work with, and there would have to be an overlap of mutually trusted mixers. (Hosts and uis should never just blindly trust a mixer the other party suggested, because the host and ui both need to trust the mixer for different reasons). Mixers wouldn't need to do very much compute so would be cheap to operate; I imagine that there would only ever need to be a handful in operation (to have redundancy and a bit of decentralization) that everyone tended to trust.

In a world where there's some kind of remote attestation scheme that was widely trusted that could certify that a given mixer was running a VM associated with a certain SHA of a public git commit, then hosts and uis could trust mixers they hadn't pre-enumerated, as long as they trusted the remote attestation scheme (which would reduce down to "I trust the GCP, Azure, GCP, and Cloudflare remote attestation schemes")

All of this is just thinking out loud. It's probably wildly overcomplicated, completely unnecessary, totally naive, or some combination of all three.

dglazkov commented 1 year ago

The concept of mixer is very important. Thank you, Alex.

dalmaer commented 1 year ago

Yah, so much of the magic is:

I'm seeing this with things like: helping my build something that used Polaris, Remix, and Preact when I have 3 shared brains to work with.